$35
Exercise 1: Hospital SUPPORT Data
For this exercise, we will use the data stored in hospital.csv on Canvas. It contains a random sample of 580 seriously ill hospitalized patients from a famous study called “SUPPORT” (Study to Understand Prognoses Preferences Outcomes and Risks of Treatment). As the name suggests, the purpose of the study was to determine what factors affected or predicted outcomes, such as how long a patient remained in the hospital. The variables in the dataset are:
• Days - Day to death or hospital discharge
• Age - Age on day of hospital admission
• Sex - Female or male
• Comorbidity - Patient diagnosed with more than one chronic disease
• EdYears - Years of education
• Education - Education level; high or low
• Income - Income level; high or low
• Charges - Hospital charges, in dollars
• Care - Level of care required; high or low
• Race - Non-white or white
• Pressure - Blood pressure, in mmHg
• Blood - White blood cell count, in gm/dL • Rate - Heart rate, in bpm part a
For this part, first fit an additive multiple regression model (that is, one without any interaction terms) predicting Charges from Age, EdYears, Pressure, Days, and Care. Report the summary of this model.
# Use this code chunk for your answer. setwd("~/Desktop/data") hospital = read.csv("hospital.csv")
lm1 = lm(Charges ~ Age + EdYears + Pressure + Days + Care, data = hospital) summary(lm1)
##
## Call:
## lm(formula = Charges ~ Age + EdYears + Pressure + Days + Care,
## data = hospital)
##
## Residuals:
## Min 1Q Median 3Q Max
## -177021 -26315 -5311 5149 435820
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 44960.9 15090.3 2.979 0.00301 **
## Age -187.4 147.0 -1.275 0.20293
## EdYears 1637.1 623.3 2.627 0.00886 ** ## Pressure -163.8 88.1 -1.860 0.06341 .
## Days 2132.4 106.7 19.990 < 2e-16 *** ## Carelow -42472.9 4862.7 -8.734 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 54750 on 574 degrees of freedom ## Multiple R-squared: 0.5397, Adjusted R-squared: 0.5357 ## F-statistic: 134.6 on 5 and 574 DF, p-value: < 2.2e-16
summary(lm1$residuals)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -177021 -26315 -5311 0 5149 435820 part b
Provide an interpretation for the intercept, the slope coefficient fitted for the Days variable, and the slope coefficient fittted for the Care variable.
Answer: Intercept: I expect an average estimated charge of 44960.90 dollars when an individual who is 0 years old with no education and a blood pressure of 0 stays in the hospital for 0 days to death or hospital discharge and requires a high level of care.
Days: For each additional day to death or hospital discharge an individual stays, given that they require a high level of care, I expect the estimated charge to increase by 2132.40 dollars on average, holding age, education, and blood pressure constant
Care: I expect individuals who require a low level of care to owe an estimated 42472.90 less than those who require a high level of care, holding age, education, blood pressure, and day to death or hospital discharges in hospital constant
part c
What is the largest and the smallest residual from this model? From the summary of the model, does it seem like the distribution of residuals is roughly symmetric?
Answer: the smallest residual is -177021 and the largest is 435820. From the output, it does not seem that the distribution of residuals is roughly symetric as many more , and larger, positive residuals exist comapred to negative residuals.
part d
Generate the default plots in R. Then, interpret each of these plots.
# Use this code chunk for your answer.
plot(lm1)
Fitted values
lm(Charges ~ Age + EdYears + Pressure + Days + Care)
Theoretical Quantiles
lm(Charges ~ Age + EdYears + Pressure + Days + Care)
Scale−Location
Fitted values
lm(Charges ~ Age + EdYears + Pressure + Days + Care)
Leverage
lm(Charges ~ Age + EdYears + Pressure + Days + Care)
Exercise 2: Model Selection in Hospital Support Data part a
Fit a multiple regression model with Charges as the response. Use the main effects of Age, EdYears, Pressure, and Days, along with all second, third, and fourth order interaction terms. For this model, you will not use the Care variable that you used in Exercise 1. Report the summary of this model. How many coefficients are included in this model? How many of these coefficients are significantly different from 0?
# Use this code chunk for your answer.
lm2 = lm(Charges ~ Age * EdYears * Pressure * Days, data = hospital) summary(lm2)
##
## Call:
## lm(formula = Charges ~ Age * EdYears * Pressure * Days, data = hospital)
##
## Residuals:
## Min 1Q Median 3Q Max
## -141516 -21298 -11751 3315 447916
##
## Coefficients:
##
Estimate Std. Error t value Pr(>|t|)
## (Intercept)
-2.263e+05 1.995e+05 -1.134 0.257
## Age
2.402e+03 2.839e+03 0.846
0.398
## EdYears
1.608e+04 1.587e+04 1.013
0.312
## Pressure
2.236e+03 2.199e+03 1.017
0.310
## Days
5.046e+03 7.246e+03 0.696
0.486
## Age:EdYears
-1.323e+02 2.281e+02 -0.580
0.562
## Age:Pressure
-2.349e+01 3.142e+01 -0.748
0.455
## EdYears:Pressure
-1.454e+02 1.774e+02 -0.820
0.413
## Age:Days
1.014e+01 1.023e+02 0.099
0.921
## EdYears:Days
2.701e+02 6.004e+02 0.450
0.653
## Pressure:Days
-1.976e+01 6.960e+01 -0.284
0.777
## Age:EdYears:Pressure
1.135e+00 2.559e+00 0.443
0.658
## Age:EdYears:Days
-8.025e+00 8.746e+00 -0.918
0.359
## Age:Pressure:Days
-2.969e-01 9.731e-01 -0.305
0.760
## EdYears:Pressure:Days -2.552e+00 5.878e+00 -0.434
0.664
## Age:EdYears:Pressure:Days 8.879e-02 8.540e-02 1.040
0.299
##
## Residual standard error: 56950 on 564 degrees of freedom ## Multiple R-squared: 0.5108, Adjusted R-squared: 0.4978
## F-statistic: 39.26 on 15 and 564 DF, p-value: < 2.2e-16
part b
Perform model selection using a forward searching mechanism, BIC as the metric, and the full interaction model from part a as your the scope of the searching. Make sure to define n in your code before using the model searching method.
# Use this code chunk for your answer.
null_model = lm(Charges ~ 1, data = hospital) n = 580
bic_measure = step(null_model, scope = Charges ~ Age * EdYears * Pressure * Days,
direction = 'forward', k = log(n))
## Start: AIC=13106.64
## Charges ~ 1
##
## Df Sum of Sq RSS AIC
## + Days 1 1.7141e+12 2.0244e+12 12757 ## + Age 1 6.9087e+10 3.6694e+12 13102
## <none> 3.7385e+12 13107
## + EdYears 1 2.6150e+10 3.7124e+12 13109 ## + Pressure 1 4.8299e+08 3.7380e+12 13113
##
## Step: AIC=12757.22
## Charges ~ Days
##
## Df Sum of Sq RSS AIC ## + EdYears 1 3.1660e+10 1.9927e+12 12754 ## + Pressure 1 2.6739e+10 1.9977e+12 12756 ## + Age 1 2.5451e+10 1.9989e+12 12756
## <none> 2.0244e+12 12757
##
## Step: AIC=12754.44
## Charges ~ Days + EdYears
##
## Df Sum of Sq RSS AIC ## + Pressure 1 2.3833e+10 1.9689e+12 12754
## <none> 1.9927e+12 12754
## + Age 1 1.8243e+10 1.9745e+12 12756
## + EdYears:Days 1 1.5287e+10 1.9775e+12 12756
##
## Step: AIC=12753.83
## Charges ~ Days + EdYears + Pressure
##
## Df Sum of Sq RSS AIC ## + Pressure:Days 1 4.8807e+10 1.9201e+12 12746
## <none> 1.9689e+12 12754
## + Age 1 1.9278e+10 1.9496e+12 12754
## + EdYears:Days 1 1.5128e+10 1.9538e+12 12756 ## + EdYears:Pressure 1 7.5542e+07 1.9688e+12 12760
##
## Step: AIC=12745.63
## Charges ~ Days + EdYears + Pressure + Days:Pressure
##
## Df Sum of Sq RSS AIC
## + Age 1 2.2773e+10 1.8973e+12 12745
## <none> 1.9201e+12 12746
## + EdYears:Days 1 5.4562e+09 1.9146e+12 12750 ## + EdYears:Pressure 1 8.5093e+08 1.9192e+12 12752
##
## Step: AIC=12745.07
## Charges ~ Days + EdYears + Pressure + Age + Days:Pressure
##
## Df Sum of Sq RSS AIC
## <none> 1.8973e+12 12745
## + Age:EdYears 1 4035858491 1.8933e+12 12750 ## + EdYears:Days 1 3544558406 1.8938e+12 12750 ## + Age:Days 1 2889007407 1.8944e+12 12751 ## + Age:Pressure 1 1277572323 1.8960e+12 12751 ## + EdYears:Pressure 1 1009563696 1.8963e+12 12751
final_model = lm(Charges ~ Days + EdYears + Pressure + Age + Days:Pressure, data = hospital)
summary(final_model)
##
## Call:
## lm(formula = Charges ~ Days + EdYears + Pressure + Age + Days:Pressure,
## data = hospital)
##
## Residuals:
## Min 1Q Median 3Q Max
## -158287 -23119 -12661 3052 449983
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13191.669 16792.404 0.786 0.4324
## Days 3575.876 309.656 11.548 < 2e-16 ***
## EdYears 1689.961 654.428 2.582 0.0101 *
## Pressure 7.306 112.673 0.065 0.9483
## Age -401.751 153.059 -2.625 0.0089 **
## Days:Pressure -11.558 2.906 -3.978 7.84e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 57490 on 574 degrees of freedom ## Multiple R-squared: 0.4925, Adjusted R-squared: 0.4881 ## F-statistic: 111.4 on 5 and 574 DF, p-value: < 2.2e-16
part c
Report the predictor variables for the model chosen after the first step.
part d
Report the final fitted model chosen by model selection; write out the full fitted model with all coefficients.
part e
The final model selected in part d should have a total of six coefficients, including the intercept. Interpret the two coefficients pertaining to Days. Then, calculate the slopes for Days for an individual whose Pressure is 115 and again for an individual whose Pressure is 67.
# Use this code chunk for your answer.
# days + days * pressure + pressure
3575.876 + (-11.558 * 67)
## [1] 2801.49
3575.876 + (-11.558 * 115)
## [1] 2246.706
part f
Suppose that we meant to use a backward searching mechanism for part d. Without actually running any code, what would be the variable removed from the model in the first step
Exercise 3: Hospital SUPPORT Data: Unusual Observations
We’ll continue exploring the hospitals dataset in this question. We’ll use a model with Charges as the response, and with predictors of Days, EdYears, Pressure, Age, and the interaction of Days with Pressure.
part a
Calculate the leverages for each observation in the dataset. Print only those leverages that are above the threshold defined in the lecture. After looking through these leverages by eye, the leverages for what specific observations, if any, appear to be especially large? Make a histogram of all leverages for the dataset.
# Use this code chunk for your answer. hatvalues = hatvalues(final_model) hatvalues[hatvalues > (2 * 6/580)]
## 10 15 26 111 116 118 130 ## 0.04634381 0.02640902 0.02772571 0.03715324 0.02587398 0.02102872 0.02762897
## 141 191 198 204 205 207 224 ## 0.04295369 0.26713797 0.02441086 0.08683177 0.04311469 0.03044227 0.14264613
## 249 252 257 317 327 368 402 ## 0.04864141 0.12447852 0.03974212 0.02069030 0.06980435 0.27460392 0.28258640
## 407 423 443 467 479 499 511
## 0.03130539 0.02375375 0.03807016 0.02449678 0.06011985 0.02118676 0.02147177
# big leverage = 10, 141, 204, 205, 224, 249, 257, 327, 402, 479 # biggest leverage = 191, 402, 252, 224 hist(hatvalues)
Histogram of hatvalues
hatvalues
part b
Calculate the standardized residuals for each observation in the dataset. Generate a histogram of all standardized residuals for the dataset. Then, print only those standardized residuals that have a magnitude greater than 2. What observations, if any, were identified as having both a large leverage in part a and as having a high standardized residual in this part? Did you define any of these observations as having an especially large leverage?
# Use this code chunk for your answer.
print(rstandard(final_model)[abs(rstandard(final_model)) > 2])
## 2 3 8 14 16 23 24 26 ## 3.841903 4.095629 2.391056 3.674166 4.682015 2.075751 4.062163 3.206343
## 34 35 38 39 53 56 58 66 ## 2.880794 6.258869 3.781971 5.782725 5.780338 2.993832 7.848393 2.421559
## 67 73 74 75 77 80 116 141 ## 5.074611 3.594462 2.871110 3.934032 2.496430 2.372697 -2.077392 -2.805432
## 191 204 218 333 351 368
## -3.216032 2.059669 3.528306 2.097318 3.552764 -2.474411
hist(rstandard(final_model))
Histogram of rstandard(final_model)
rstandard(final_model)
part c
Calculate the Cook’s distance for each observation in the dataset. Print only those observations that are above the threshold defined in lecture. After looking through these Cook’s distances by eye, the Cook’s distance for what specific observations, if any, appear to be especially large? Finally, what is Cook’s distance used to measure?
# Use this code chunk for your answer.
cooks.distance(final_model)[cooks.distance(final_model) > (4/580)]
## 2 3 14 16 23 24 ## 0.040024463 0.026616539 0.020896820 0.070609267 0.012838702 0.010341465
## 26 34 35 38 39 53 ## 0.048861008 0.014618233 0.078228116 0.013923893 0.031448957 0.074117884
## 56 58 66 67 74 75 ## 0.009422662 0.056850484 0.007970262 0.023146989 0.015951410 0.011472095
## 77 116 141 191 204 218 ## 0.011965780 0.019104409 0.058872958 0.628351971 0.067231322 0.018476456
## 224 230 252 305 327 333 ## 0.046861256 0.007125327 0.009141393 0.008505841 0.028897794 0.009041465
## 351 368 479
## 0.014652282 0.386299268 0.038422047
part d
In order to assess the fit of this model, calculate the value of the RMSE using leave one out cross validation.
# Use this code chunk for your answer.
sqrt(mean((resid(final_model)/(1 - hatvalues(final_model))) ˆ 2))
## [1] 58364.68
Exercise 4: Scottish Hill Races
For this last exercise, we’ll use the races.table dataset that includes information on record-winning times for 35 hill races in Scotland, as reported by Atkinson (1986). The additional variables record the overall distance travelled and the height climbed in the race. Below, we are reading in the data from an online source. We do correct one error reported by Atkinson before beginning our analysis.
Source: Atkinson, A. C. (1986). Comment: Aspects of diagnostic regression analysis (discussion of paper by Chatterjee and Hadi). Statistical Science, 1, 397-402.
url = 'http://www.statsci.org/data/general/hills.txt' races.table = read.table(url, header=TRUE, sep='\t')
races.table[18,4] = 18.65 head(races.table)
## Race Distance Climb Time ## 1 Greenmantle 2.5 650 16.083 ## 2 Carnethy 6.0 2500 48.350 ## 3 CraigDunain 6.0 900 33.650 ## 4 BenRha 7.5 800 45.600 ## 5 BenLomond 8.0 3070 62.267 ## 6 Goatfell 8.0 2866 73.217 part a
Create a scatterplot matrix of the quantitative variables contained in the race.table dataset. Interpret this scatterplot matrix. What variable do you think will be more important in predicting the record time of that race?
# Use this code chunk for your answer.
for_matrix1 = races.table[,2:4] pairs(for_matrix1)
part b
Fit a multiple regression model predicting the record time of a race from the distance travelled, the height climbed, and an interaction of the two variables. Report the summary of the model. What is the R2 for this model? What does this suggest about the strength of the model?
# Use this code chunk for your answer.
lm3 = lm (Time ~ Distance * Climb, data = races.table) summary(lm3)
##
## Call:
## lm(formula = Time ~ Distance * Climb, data = races.table)
##
## Residuals:
## Min 1Q Median 3Q Max ## -23.3078 -2.8309 0.7048 2.2312 18.9270
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.3532285 3.9121887 -0.090 0.928638
## Distance 4.9290166 0.4750168 10.377 1.32e-11 ***
## Climb 0.0035217 0.0023686 1.487 0.147156
## Distance:Climb 0.0006731 0.0001746 3.856 0.000545 *** ## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.35 on 31 degrees of freedom
## Multiple R-squared: 0.9806, Adjusted R-squared: 0.9787
## F-statistic: 521.1 on 3 and 31 DF, p-value: < 2.2e-16
part c
Identify any influential points as defined in the lecture. Which of these observations, if any, are especially influential based on their values? For these influential points, do they have high leverage, high standardized residual, both, or neither?
# Use this code chunk for your answer.
cooks.distance(lm3)[cooks.distance(lm3) > (4/35)]
## 7 11 35
## 3.758307 2.704165 1.805942
# 7, 11, 35 hatvalues(lm3)[hatvalues(lm3) > (8/35)]
## 7 11 33 35
## 0.5207512 0.7182517 0.2379383 0.3261854
(rstandard(lm3)[abs(rstandard(lm3)) > 2])
## 7 11 35
## 3.719559 2.059866 -3.862957
part d
Refit the model from part b without any points that you identified as influential. Note: this is not something that we should automatically do, but we will do it for now as a demonstration of how much our model may be affected by these points! Print the coefficients for this model. How do they compare to the coefficients from the model in part b?
Hint: Create a subset of your data that only includes those points that are not influential before fitting your data.
# Use this code chunk for your answer. remove_obs = races.table[-c(7, 11, 35),]
lm4 = lm(Time ~ Distance * Climb, data = remove_obs) summary(lm4)
##
## Call:
## lm(formula = Time ~ Distance * Climb, data = remove_obs)
##
## Residuals:
## Min 1Q Median 3Q Max ## -11.1574 -2.7089 0.3387 2.2074 10.3180
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.6141193 3.3490412 0.183 0.855828
## Distance 5.1003107 0.6078802 8.390 3.98e-09 ***
## Climb 0.0018117 0.0017531 1.033 0.310273
## Distance:Climb 0.0007105 0.0001663 4.272 0.000202 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.877 on 28 degrees of freedom
## Multiple R-squared: 0.9778, Adjusted R-squared: 0.9754
## F-statistic: 411.1 on 3 and 28 DF, p-value: < 2.2e-16
part e
How much does this updated model affect our actual predictions for the response? Let’s create a scatterplot that compares our fitted values from our original model to those from our newer model (influential points removed).
Calculate and save each of the fitted values (for the original model and for the newer model) to their own named object in R. Note: If you are using the predict function, you can supply as an argument newdata = races.table since we will use all of the variables and all of the data.
Then, create a dataframe in R by providing your two named objects with fitted values as two arguments inside the data.frame function, and save the result to a new named object in R.
Now, create a scatterplot to compare the fitted values for each model. Include an appropriate title and axes labels. All other formatting is optional and up to you!
It might be helpful to add a line with intercept 0 and slope 1 to represent what perfect matching would look like.
Finally, briefly comment on what this plot reveals. Would you say there are big differences in the predictions made by each model, or would you say the predictions by each model are quite similar? Is this what you would expect from the results in part c?
# Use this code chunk for your answer. old_fitted = fitted(lm3)
old_fitted = old_fitted[-c(7, 11, 35)] new_fitted = fitted(lm4)
df = data.frame(old_fitted, new_fitted)
ggplot(df, aes(x = old_fitted, y = new_fitted)) + geom_point() +
labs(x = 'Old Fitted Value', y = 'New Fitted Value', title = 'Comaprison of Fitted Values With and Without Influential Points') +
geom_abline(slope = 1, intercept = 0, color = 'Blue')
Comaprison of Fitted Values With and Without Influential Points
Exercise 5: Formatting
The last five points of the assignment will be earned for properly formatting your final document. Check that you have:
• included your name on the document
• properly assigned pages to exercises on Gradescope
• selected page 1 (with your name) and this page for this exercise (Exercise 5)
• all code is printed and readable for each question
• all output is printed
• generated a pdf file