Starting from:

$30

DATA 303/473 Assignment 1 -Solved

.Data on US cancer mortality rates for over 3000 counties are available in the dataset 
cancer_reg.csv available on Blackboard. The data were obtained from the Data World website (https: 
//data.world/nrippner/ols-regression-challenge). Read the data set into R and use it to answer the questions 
that follow. We’ll use the subset of variables listed below: 
• incidencerate: Mean per capita (100,000) cancer diagnoses1 
• medincome: Median annual income (dollars) per county (2 
• povertypercent: Percent of county population in poverty2 
• studypercap: Per capita number of cancer-related clinical trials per county1 
• medianage: Median age (in years) of county residents2 
• pctunemployed16_over: Percent of county residents aged 16 and over that are unemployed2 
• pctprivatecoverage: Percent of county residents with private health coverage2 
• pctbachdeg25_over: Percent of county residents aged 25 and over with bachelor’s degree as highest 
education attained2 
• target_deathrate: Response variable. Mean per capita (100,000) cancer mortalities1 
1 Years 2010-2016 2 2013 Census Estimates 
a. Create a new dataset called cancer2 that contains only the subset of variables listed above. 
Based on a summary of the variables in the dataset and the plots below, identify any variable or 
variables that have obviously incorrect values. For the variables you identify, write and implement code 
to fifilter out the incorrect values. Give the number of observations left in the dataset. 
100 
200 
300 
250 
500 
750 1000 1250 
Mean cancer diagnoses 
 per 100,000 
100 
200 
300 
250005000075000100000125000 
Median income per county 
100 
200 
300 
10 
20 
30 
40 
Percent of population 
 in poverty 
100 
200 
300 

2500 5000 7500 10000 
Number of cancer−related 
 clinical trials per county 

100 
200 
300 

200 
400 
600 
Median age of county 
100 
200 
300 

10 
20 
30 
% aged 16 and over 
 who are unemployed 
100 
200 
300 
20 
40 
60 
80 
% with private 
 health coverage 
100 
200 
300 
10 
20 
30 
40 
% aged 25 and over with 
 Bachelor's degree as highest qualification 
b. ( Some data cleaning is done on cancer2 and a new dataset cancer3.csv (available on 
Blackboard) is created. Construct a scatterplot matrix of all variables in the new dataset. List any 
key points of note from the scatterplot matrix, including any considerations you might make during a 
regression analysis. 

Mortality 
Mortality 
Mortality 
Mortality 
Mortality 
Mortality 
Mortality 
Mortalityc. Fit a linear model to the data in cancer3, including all predictors with no transformations 
or interactions. Present a summary of the model in a table. Give an estimate of σ2 , the error variance. 
d. (Suppose two counties diffffer by 1 per 100,000 in mean cancer diagnoses with all else being 
equal. Based on the model fifitted in part (c), what is the difffference in expected cancer mortality for 
these two counties? 
e. Does it make practical sense to interpret the intercept for the model in part (c)? Justify 
your answer. 
f. (The model fifitted in part (c) is to be used to predict cancer mortality for a county with 
the predictor values below. Obtain 95% confifidence and prediction intervals for such a county. Explain 
brieflfly why the prediction interval is wider than the confifidence interval. 
• incidencerate: 452 
• medincome: 23000 
• povertypercent: 16 
• studypercap: 150 
• medianage: 40 
• pctunemployed16_over: 8 
• pctprivatecoverage: 70 
• pctbachdeg25_over: 50 
g. (3 marks) Assuming all regression assumptions hold, are the intervals you obtained in part (f) likely 
to be valid? Explain your answer brieflfly. 
h. (3 marks) Based on a global usefulness test, is it worth going on to further analyse and interpret a 
model of target_deathrate against each of the predictors? Carry out the test, give the conclusion 
and justify your answer. 
i.  The plots below are constructed from the cleaned dataset cancer3. Which predictors, if 
any, would you consider applying log or polynomial transformations to? Explain your answer brieflfly. 
100 
200 
300 
250 
500 
750 1000 1250 
Mean cancer diagnoses 
 per 100,000 
100 
200 
300 
250005000075000100000125000 
Median income per county 
100 
200 
300 
10 
20 
30 
40 
Percent of population 
 in poverty 
100 
200 
300 

2500 5000 7500 10000 
Number of cancer−related 
 clinical trials per county 
100 
200 
300 
30 
40 
50 
60 
Median age of county 
100 
200 
300 

10 
20 
30 
% aged 16 and over 
 who are unemployed 
100 
200 
300 
20 
40 
60 
80 
% with private 
 health coverage 
100 
200 
300 
10 
20 
30 
40 
% aged 25 and over with 
 Bachelor's degree as highest qualification 

Mortality 
Mortality 
Mortality 
Mortality 
Mortality 
Mortality 
Mortality 
MortalityQ2.  Francis Galton’s 1866 dataset (cleaned) lists individual observations on height for 899 
children. Galton coined the term “regression” following his study of how children’s heights related to heights 
of their parents. The data are available in the fifile galton.csv and contain the following variables: 
• familyID: Family ID 
• father: Height of father 
• mother: Height of mother 
• gender: gender of child 
• height: Height of child 
• kids: Number of childre in family 
• midparent: Mid-parent height calculated as (‘father + 1.08*mother)/2 
• adltchld: height if gender=M, otherwise 1.08*height if gender= F 
All heights are measured in inches. 
a. Read the data into R and fifit a linear model for height with the variables father, mother, 
gender, kids and midparent as predictors. Provide a summary of the fifitted model. You will notice 
that estimates for midparent are listed as NA. Why might this be the case and what regression problem 
does this point to? 
b. (2 marks) What action might you take to resolve the problem identifified in part (a)? 
c. (2 marks) Based on the model fifitted in part (a) give an interpretation of the coeffiffifficient for genderM. 
d. (2 marks) Determine the number of families in the dataset. 
e. (3 marks) The problem in part (a) is resolved and a new linear model is fifitted.No observations are 
excluded. The plots below are obtained to investigate regression assumptions for this new model. Based 
on your answer in part (d) and the plots below, do the data meet all the regression assumptions? 
Explain your answer brieflfly. 
62 
64 
66 
68 
70 
72 
74 
Fitted values 
Residuals vs Fitted 
479 
289 
60 
−3 
−2 
−1 




Theoretical Quantiles 
Normal Q−Q 
479 
289 
60 
62 
64 
66 
68 
70 
72 
74 
Fitted values 
Scale−Location 
479 
60289 
0.000 
0.005 
0.010 
0.015 
0.020 
Leverage 
Cook's distance 
Residuals vs Leverage 
815 
60 
126 
Assignment total: 40 marks 

−10 0 
10 
Residuals 
−4 


Standardized residuals 
0.0 1.0 2.0 
Standardized residuals 
−4 0 

Standardized residuals

More products