Starting from:

$30

DATA 303/473 Assignment 4 -Solved

1Background and Data 
When data are collected through surveys involving human subjects, those who design the survey need to 
think carefully about factors that might inflfluence whether respondents refuse to provide specifific information 
or are unable to do so, a problem known as “non-response”. Respondents are less likely to provide responses 
to questions on sensitive subjects. One such sensitive subject is income with Juster et al. (2006) estimating 
that roughly one-third of questions related to income result in non-response, although this is highly variable 
from survey to survey. In the United States, Lillard, Smith, and Welch (1986) reported non-response for total 
income of 2.5% in the 1940 Current Population Survey (CPS) and then a steady increase to 26.6% for the 
1982 CPS. This high non-response rate has seemingly stabilized or decreased around the turn of the century 
with Moore, Stinson, and Welniak (2000) reporting a non-response rate for income questions of roughly 25% 
for the 1996 CPS and Dixon (2005) showing non-response for income questions to have dropped to 14.2% by 
the 2002-2003 CPS. 
In the context of sub-Saharan African countries, Argent (2009) reported non-response according to a variety 
of income categories in the National Income Dynamics Study in South Africa with non-response rates ranging 
from 2.3% to 52.4%. In Mozambique, Fonseca (2014) reported a non-response rate of 39.5% for income 
questions for household surveys administered to 1,710 households across 68 communities as part of the 
WASHCost programme. Although estimates were not provided, Fonseca reported non-response rates to 
be even higher for income questions for similar surveys administered to households in Ghana as part of 
WASHCost. 
Although the reasons for non-response may be varied, non-response to income-related questions is generally 
believed to be related to income of the respondent and so not missing at random. Lillard, Smith, and Welch 
(1986) found non-response to income-related questions to increase with income of the respondent. Biewen 
(2001), on the other hand, found non-response to income-related questions to be highest for those in the 
tails of the income distribution. In his examination of the German Socio-Economic Panel (GSOEP) study, 
Schräpler (2004) looked at refusals and responses of “don’t know” to income-related questions and found that 
refusals were signifificantly higher with those reporting vocational positions classifified as “high” (e.g., executives, 
civil servants) while responses of “don’t know” were signifificantly higher with those reporting vocational 
positions classifified as “low” (e.g., unskilled workers). As there is likely to be a fairly strong relationship 
between vocation classifification and income, this result would appear to be in line with the fifindings of Biewen 
(2001). And Argent (2009) noted that there “is a general consensus that refusals to income questions are 
unlikely to be random with respect to income, with those of very high and very low incomes being less likely 
to respond,” also in agreement with Biewen (2001). 
In this assignment, we consider income data collected as part of a study carried out in three towns in northern 
Mozambique. This study sought to understand how much people would be willing to pay for water piped to 
their premises and factors that may inflfluence how much they would be willing to pay. At the end of the 
survey, participants were asked a number of income-related questions, and our focus will be on factors that 
are associated with whether respondents provide a numeric value for their total income. A subset of variables 
collected as part of this study is contained in the dataset WTP.csv, and a list of the variables is presented in 
the table on the next page. Like many social surveys, there are a variety of special codes used in this dataset 
to reflflect difffferent types of missing data, and these are as follows: 
Code Description 
-1 Respondent refused to answer the question. 
9998 The question is not applicable to the respondent. This is most commonly due to a 
response to a previous question. 
9999 The respondent specififies that they do not know. 
NA 
A response was not recorded. 
2Variable Description 
HH 
Household identififier for a given enumeration area (1-15) 
DAY 
Day of interview 
TOWN 
Town (0 = “Nampula”, 1 = “Liupo”, 2 = “Ribaue”) 
YEARS 
Years household has lived in <TOWN> 
SEX 
Sex of the respondent (0 = “Male”, 1 = “Female”) 
AGE 
Age of the respondent (in years) 
STATUS 
Marital status of the respondent (0 = “Single”, 1 = “Married”, 2 = “Marital union”, 3 
= “Divorced”, 4 = “Separated”, 5 = “Widowed”) 
EDUC 
Education level of the respondent (0 = “None”, 1 = “Primary of the 1st degree”, 2 = 
“Primary of the 2nd degree”, 3 = “Secondary of the 1st degree”, 4 = “Secondary of the 
2 nd degree”, 5 = “Higher level”) 
DISABLED 
Disability status of the respondent (0 = “None”, 1 = “Physical”, 2 = “Sight/visual”, 3 
= “Other sensory”, 4 = “Mental”) 
HEAD 
Is the respondent the head of the household? (0 = “No”, 1 = “Yes”) 
SEX_HH 
Sex of the head of the household (0 = “Male”, 1 = “Female”) 
AGE_HH 
Age of the head of the household 
STATUS_HH 
Marital status of the head of the household 
EDUC_HH 
Education level of the head of the household 
DISABLED_HH 
Disability status of the head of the household 
AGE_HH_SPOUSE 
Age of the spouse of the head of the household 
EDUC_HH_SPOUSE 
Education level of the spouse of the head of the household 
DISABLED_HH_SPOUSEDisability status of the spouse of the head of the household 
HH_SIZE 
Number of persons regularly living in the household 
N_ADULTS 
Number of adults regularly living in the household 
PRIMARY_WS 
Primary water source used by household (1 = “Tap in the house”, 2 = “Tap in the 
yard”, 3 = “Public tap”, 4 = “Tap of a neighbour”, 5 = “Public borehole”, 6 = “Well”, 
7 = “Protected spring”, 8 = “Unprotected spring”, 9 = “River, lake, or stream”). 
These are considered to be hierarchical with 1 considered to be the best type of water 
source and 9 the worst 
SUFFICIENT_WATER 
Does the household have suffiffifficient access to water for its daily needs? (0 = “No”, 1 = 
“Yes”) 
TOTAL_TIME 
Total time required to travel to water source, queue for water and return home when 
collecting water. 
PAY_WATER 
Does the household pay for water? (0 = “No”, 1 = “Yes”) 
TOTAL_COST 
Average amount (in Mozambican meticals) the household spends each month on 
water-related costs (including the cost of water, water treatment, transportation of 
water to the home, etc.) 
ELECTRIC 
Is the household connected to the electrical grid? (0 = “No”, 1 = “Yes”) 
TOTAL_INCOME 
Average total monthly income of the household (in Mozambican meticals) 
TIME_LENGTH 
How long did the survey take to complete (in minutes)? 
The data are available in the fifile WTP.csv, which can be read into R using the code below but with the path 
changed to point to the location of the fifile on your computer. 
# Read in the Mozambican willingness to pay dataset. 
wtp <- read.csv("WTP.csv") 
3Assignment Questions 
1. Data pre-processing: 
a. (2 marks) The variable TOTAL_INCOME records the numeric value for total income for respondents 
who could or were willing to provide this information Use this variable to add a new variable 
INCOME_NONRESPONSE to the data frame wtp. The new variable INCOME_NONRESPONSE should be a 
binary variable indicating whether the person did not provide a numeric total income data (0 = 
“Provided numeric income data”, 1 = “Did not provide numeric income data”). Show your code to 
produce this new variable as well as a table of the frequency of outcomes of 0 and 1. 
b. (2 marks) We will restrict our focus to a subset of demographic variables and variables that are 
generally associated with income (and so can be considered as proxies for income). In particular, 
we will consider a reduced dataset consisting only of the following nine variables: 
TOWN SEX AGE EDUC HEAD 
PAY_WATER ELECTRIC TIME_LENGTH INCOME_NONRESPONSE 
Create a new data frame wtp.reduced that consists of only these variables. Show your code to 
produce this new data frame. 
c. Now create a new data frame called wtp.complete, which only keeps respon
dents/observations from wtp.reduced that have no missing data. Show your code to produce 
this new data frame. (Note: Pay close attention to special codes for missing values.) In 
total, what proportion (to 3dp) of respondents/observations have been removed from the original 
dataset to produce this fifinal data frame? 
d. Which variables contained in wtp.complete are factors? List these variables, and 
show code to overwrite these variables in the data frame wtp.complete so that they are recognised 
by R as factors. (Note: DO NOT convert the variable INCOME_NONRESPONSE to a factor 
and overwrite the original variable.) 
2. Inferential analysis:
Now we will focus on how non-response to a question asking for a numeric value for the total average 
monthly income of the household is related to demographic factors of the respondent and proxies for 
income. 
a. (3 marks) Fit a logistic regression model of income non-response (INCOME_NONRESPONSE) on sex 
of the respondent (SEX), their age (AGE), highest level of education completed (EDUC), whether the 
household pays for water (PAY_WATER), and whether the household is connected to the electrical 
grid (ELECTRIC). For this logistic regression model, calculate the variance inflflation factors for 
predictors (to 3dp) to determine whether or not there is evidence of signifificant multicollinearity 
among the predictors in the model. If so, comment on which predictor(s) should be removed, and 
use this model for subsequent parts of this question. 
b. Provide summary output for the logistic regression model specifified in part (a). Explain 
what you can conclude based on Wald tests of coeffiffifficients. Provide evidence to support your 
conclusion. 
c. ( For any signifificant Wald tests in part (b), provide a precise interpetation of what the 
estimated coeffiffifficient suggests about the “effffect” of the predictor on the response, and calculate a 
corresponding 95% confifidence interval (to 3dp) for the estimated “effffect”. 
d. (3 marks) Fit the model considered in part (a) but additionally include interactions between 
• sex of the respondent and whether the household pays for water and 
4• sex of the respondent and whether the household is connected to the electrical grid. 
Provide summary output for this model. For this model, explain what it means for sex of the 
respondent to interact with i) whether the household pays for water and ii) whether the household 
is connected to the electrical grid. 
e. (3 marks) Perform an appropriate test to determine if the logistic regression model fifit in part (d) 
provides a signifificantly better fifit than the model that was fifit in part (a). Be sure to write out the 
full form of the logistic regression models fifit in parts (a) and (d), clearly explaining what variables 
represent, and state 
i) the hypotheses of the test, 
ii) the value of the test statistic, 
iii) the distribution of the test statistic, 
iv) the p-value of the test, and 
v) your conclusion. 
f. (3 marks) Finally, for the best model of the two you fifit (in parts (a) and (d)), perform a 
Hosmer-Lemeshow test for g = 5, 10, and 15 groups, and comment on what these suggest about 
the goodness-of-fifit of this model to the income non-response data. 
3. Statistical learning:
Now we perform an exploratory analysis to try to identify the best set of predictors in predicting 
whether a respondent will not report a numeric value for income. Consider as predictors all variables in 
wtp.complete (other than the outcome of interest, INCOME_NONRESPONSE). 
a.  Find the optimal models identifified by forward and backward selection algorithms. 
Report the predictors included in these optimal models. If these models are difffferent, highlight 
how they diffffer, and explain why forward and backward selection algorithms may not arrive at the 
same optimal model. 
b. Find the optimal models identifified by best subset selection using AIC and BIC as 
selection criteria. Report the predictors included in these optimal models. If these models are 
difffferent, highlight how they diffffer, and explain why the criteria of AIC and BIC may lead to 
difffferent “best” models. 
c Consider all possible combinations of the eight predictor variables for a cross-validation 
routine to select the optimal model(s) based on maximising area under the receiver operating 
characteristic curve (AUC). Use 50 repetitions of 10-fold cross-validation. If this model (or these 
models) diffffer from those identifified as “best” in parts (a) and (b), explain why this may be the 

More products