$35
Exercise 1: Hockey Goalies
We will use the data stored in goalies.csv, which contains career data for 462 players in the National
Hockey League who played goaltender at some point up to and including the 2014-2015 season. The variables in this dataset are:
• W - Wins
• GA - Goals Against
• SA - Shots Against
• SV - Saves
• SV_PCT - Save Percentage
• GAA - Goals Against Average
• SO - Shutouts
• MIN - Minutes
• PIM - Penalties in Minutes
part a
Read in the data. Then fit the following multiple linear regression model in R. Save the model to a name and run a summary of the model.
Yi = β0 + β1xi1 + β2xi2 + β3xi3 + i
.
Here,
• Yi is W (Wins)
• xi1 is GAA (Goals Against Average)
• xi2 is SV_PCT (Save Percentage)
• xi3 is MIN (Minutes)
# Use this code chunk for your answer.
setwd("~/Desktop/data") goalies = read.csv("goalies.csv")
lm1 = lm(W ~ GAA + SV_PCT + MIN, data = goalies) summary(lm1)
##
## Call:
## lm(formula = W ~ GAA + SV_PCT + MIN, data = goalies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -88.527 -4.948 1.923 4.831 98.938
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.358e+01 2.206e+01 -0.616 0.538 ## GAA -5.822e-01 6.384e-01 -0.912 0.362 ## SV_PCT 1.269e+01 2.313e+01 0.549 0.584
## MIN 7.998e-03 6.113e-05 130.844 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.67 on 458 degrees of freedom ## Multiple R-squared: 0.9746, Adjusted R-squared: 0.9744 ## F-statistic: 5855 on 3 and 458 DF, p-value: < 2.2e-16
part b
Use an F-test to test the significance of the regression.
Report the following:
• The null and alternative hypotheses
• The value of the test statistic
• The p-value of the test
• A statistical decision at α = 0.01
# Use this code chunk for you answer, as needed.
intercept_model = lm(W ~ 1, data = goalies)
anova(intercept_model, lm1)
## Analysis of Variance Table
##
## Model 1: W ~ 1
## Model 2: W ~ GAA + SV_PCT + MIN
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 461 5008654
## 2 458 127285 3 4881368 5854.7 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
part c
Consider this statement: “Since the F-test result gives a very low p-value, then we can conclude that knowing the goals against average, save percentage, and minutes of an NHL goalie allows you to make a highly accurate prediction of that goalie’s wins.” Do you think this is a good conclusion to draw, or not? Explain your answer
part d
Use your model to predict the number of Wins for famous NHL goalie Tony Esposito, who has 2.93 Goals Against Average, 0.906 Save Percentage, and 52476 Minutes.
# Use this code chunk for your answer.
predict(lm1, data.frame('GAA' = 2.93, 'SV_PCT' = 0.906,
'MIN' = 52476))
## 1
## 415.9203
part e
Point estimates may have some error, so let’s instead create an interval for wins that should contain the true wins of a goalie with these stats 90% of the time.
Create (and print) an interval to estimate the wins of a goalie with Tony Esposito’s stats with 90% confidence.
# Use this code chunk for your answer.
predict(lm1, data.frame('GAA' = 2.93, 'SV_PCT' = 0.906,
'MIN' = 52476),
interval = 'prediction', level = 0.9)
## fit lwr upr
## 1 415.9203 388.0814 443.7591
part f
Calculate the standard deviation sy for the observed values of the Wins variable. Report the value of se from your multiple regression model.
Briefly interpret what each measure represents.
Do these two measures together communicate anything about the strength of this model? Hint: think about how each of these values is related to our SS terms from the semester.
# Use this code chunk for your answer.
s_y = sd(goalies$W) s_y
## [1] 104.2342
sst = (s_y)ˆ2 * (dim(goalies)[1]) # 463 = 464 - 1 summary(lm1)
##
## Call:
## lm(formula = W ~ GAA + SV_PCT + MIN, data = goalies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -88.527 -4.948 1.923 4.831 98.938
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.358e+01 2.206e+01 -0.616 0.538 ## GAA -5.822e-01 6.384e-01 -0.912 0.362 ## SV_PCT 1.269e+01 2.313e+01 0.549 0.584
## MIN 7.998e-03 6.113e-05 130.844 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.67 on 458 degrees of freedom ## Multiple R-squared: 0.9746, Adjusted R-squared: 0.9744 ## F-statistic: 5855 on 3 and 458 DF, p-value: < 2.2e-16
s_e = 16.67 s_e
## [1] 16.67
mse = s_eˆ2 mse
## [1] 277.8889
sse = mse * 458 # 458 = n - p
sse
## [1] 127273.1
1- sse/sst
## [1] 0.9746444
Exercise 2: Hockey Goalies, Testing
We will consider four models, each with Wins as the response. The predictors for these models are:
• Model 1: Goals Against, Saves
• Model 2: Shots Against, Minutes, Shutouts
• Model 3: Goals Against, Saves, Shots Against, Minutes, Shutouts • Model 4: All Available Variables
part a
An F-test allows us to compare two models. An F-test will not provide interpretable results for one set of two models. Which set is it?
part b
Use an F-test to compare Models 2 and 3. Report the following:
• The null hypothesis (you can write this in words or symbols)
• The value of the test statistic
• The p-value of the test
• A statistical decision at α = 0.01
• Your model preference (given this test result).
# Use this code chunk for your answer.
model_2 = lm(data = goalies,
W ~ SA + MIN + SO)
model_3 = lm(data = goalies,
W ~ SA + MIN + SO + SV + GA)
anova(model_2, model_3)
## Analysis of Variance Table
##
## Model 1: W ~ SA + MIN + SO
## Model 2: W ~ SA + MIN + SO + SV + GA
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 458 84129
## 2 456 72899 2 11230 35.124 6.496e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
part c
Use a t-test to test if the variable Minutes (MIN) has a linear relationship with Wins after accounting for all other predictors in the dataset. In other words, test H0 : β‘MIN‘ = 0 vs. H1 : β‘MIN‘ = 06 for a specific model (which model is it?). Report the following:
• The value of the test statistic
• The p-value of the test
• A statistical decision at α = 0.05
# Use this code chunk for your answer.
model_t_test = lm(data = goalies,
W ~ GA + SA + SV + SV_PCT + GAA + SO + MIN + PIM)
summary(model_t_test)
##
## Call:
## lm(formula = W ~ GA + SA + SV + SV_PCT + GAA + SO + MIN + PIM,
## data = goalies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.204 -3.126 0.935 2.835 64.078
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.2651619 16.8181423 0.313 0.754376
## GA -0.1132805 0.0148085 -7.650 1.22e-13 *** ## SA 0.0516385 0.0135565 3.809 0.000159 *** ## SV -0.0582151 0.0150905 -3.858 0.000131 ***
## SV_PCT -8.0475191 17.6600154 -0.456 0.648830 ## GAA -0.0496006 0.4821957 -0.103 0.918116
## SO 0.4599359 0.1989567 2.312 0.021240 *
## MIN 0.0131790 0.0009504 13.867 < 2e-16 *** ## PIM 0.0468422 0.0136373 3.435 0.000647 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.52 on 453 degrees of freedom ## Multiple R-squared: 0.9858, Adjusted R-squared: 0.9856
## F-statistic: 3938 on 8 and 453 DF, p-value: < 2.2e-16
Exercise 3: Model Selection by Hand
Using the goalies dataset, we’ll perform model selection by hand. We would like to choose a model to predict the number of wins from the other variables in the dataset.
part a
We’ll perform model selection in this exercise “by hand”. That means you should not use the step function in R for this exercise; if you do, you will not receive credit. We will use a backward searching process and will use the coefficient p-values to determine which variables to remove from the model, with an α of 0.01.
Show the starting model and any subsequent models fit during your searching process here.
# Use this code chunk for your answer.
summary(model_t_test)
##
## Call:
## lm(formula = W ~ GA + SA + SV + SV_PCT + GAA + SO + MIN + PIM,
## data = goalies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.204 -3.126 0.935 2.835 64.078
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.2651619 16.8181423 0.313 0.754376
## GA -0.1132805 0.0148085 -7.650 1.22e-13 *** ## SA 0.0516385 0.0135565 3.809 0.000159 *** ## SV -0.0582151 0.0150905 -3.858 0.000131 ***
## SV_PCT -8.0475191 17.6600154 -0.456 0.648830 ## GAA -0.0496006 0.4821957 -0.103 0.918116
## SO 0.4599359 0.1989567 2.312 0.021240 *
## MIN 0.0131790 0.0009504 13.867 < 2e-16 *** ## PIM 0.0468422 0.0136373 3.435 0.000647 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.52 on 453 degrees of freedom ## Multiple R-squared: 0.9858, Adjusted R-squared: 0.9856
## F-statistic: 3938 on 8 and 453 DF, p-value: < 2.2e-16
step1 = lm(data = goalies,
W ~ GA + SA + SV + SO + MIN + PIM + SV_PCT)
summary(step1)
##
## Call:
## lm(formula = W ~ GA + SA + SV + SO + MIN + PIM + SV_PCT, data = goalies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.201 -3.110 0.936 2.796 64.078
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.958596 11.011238 0.360 0.719384
## GA -0.113379 0.014761 -7.681 9.8e-14 *** ## SA 0.051681 0.013535 3.818 0.000153 *** ## SV -0.058266 0.015066 -3.867 0.000126 ***
## SO 0.459474 0.198689 2.313 0.021195 *
## MIN 0.013186 0.000947 13.924 < 2e-16 *** ## PIM 0.046831 0.013622 3.438 0.000640 ***
## SV_PCT -6.759465 12.439442 -0.543 0.587128
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.51 on 454 degrees of freedom ## Multiple R-squared: 0.9858, Adjusted R-squared: 0.9856
## F-statistic: 4511 on 7 and 454 DF, p-value: < 2.2e-16
step2 = lm(data = goalies,
W ~ GA + SA + SV + SO + MIN + PIM)
summary(step2)
##
## Call:
## lm(formula = W ~ GA + SA + SV + SO + MIN + PIM, data = goalies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.206 -3.067 1.187 2.696 64.059
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.0111736 0.7420675 -2.710 0.006978 **
## GA -0.1129541 0.0147289 -7.669 1.06e-13 *** ## SA 0.0520814 0.0135049 3.856 0.000132 *** ## SV -0.0587246 0.0150306 -3.907 0.000108 ***
## SO 0.4655961 0.1982159 2.349 0.019254 *
## MIN 0.0131616 0.0009452 13.925 < 2e-16 *** ## PIM 0.0469398 0.0136100 3.449 0.000615 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.5 on 455 degrees of freedom
## Multiple R-squared: 0.9858, Adjusted R-squared: 0.9856
## F-statistic: 5271 on 6 and 455 DF, p-value: < 2.2e-16
step3 = lm(data = goalies,
W ~ GA + SA + SV + MIN + PIM)
summary(step3)
##
## Call:
## lm(formula = W ~ GA + SA + SV + MIN + PIM, data = goalies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.922 -3.546 1.294 2.737 63.656
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.0198636 0.7457250 -2.709 0.007011 **
## GA -0.1359994 0.0110400 -12.319 < 2e-16 *** ## SA 0.0512308 0.0135668 3.776 0.000180 *** ## SV -0.0581577 0.0151029 -3.851 0.000135 *** ## MIN 0.0148741 0.0006045 24.607 < 2e-16 ***
## PIM 0.0426871 0.0135557 3.149 0.001746 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.56 on 456 degrees of freedom ## Multiple R-squared: 0.9856, Adjusted R-squared: 0.9855 ## F-statistic: 6262 on 5 and 456 DF, p-value: < 2.2e-16
part b
Report the predictor variables included in your selected model from part a.
Report the fitted model for your selected model.
Exercise 4: Chick-fil-A Searching Methods
For this exercise, we’ll analyze the nutritional value of menu items from Chick-fil-A, a fast food restaurant specializing in chicken sandwiches. This data is contained in the chickfila.csv file on Canvas.
We’ll be interested in fitting a model to predict the Calories in a menu item from the other nutritional characteristics of that menu item.
part a
Read in the chickfila.csv data file. How many models predicting the number of Calories in a menu item are possible from this dataset? (Consider only first-order terms, which means include all of the variables once and exactly as they appear in the dataset.)
# Use this code chunk for your answer, as needed.
setwd("~/Desktop/data") cfa = read.csv("chickfila.csv")
2ˆ10
## [1] 1024
part b
Perform model selection, using BIC as the metric and backward searching.
Report the predictor variables selected for the final model. No need to report the fitted coefficients.
# Use this code chunk for your answer.
start_model = lm(Calories ~ Fat + SatFat + TransFat + Cholesterol + Sodium + Carbs + Fiber + Sugar + Protein + Serving, data = cfa)
step(data = cfa, object = start_model ,
direction = 'backward', k = log(290))
## Start: AIC=1465.63
## Calories ~ Fat + SatFat + TransFat + Cholesterol + Sodium + Carbs +
## Fiber + Sugar + Protein + Serving
##
## Df Sum of Sq RSS AIC
## - TransFat 1 34 36667 1460.2
## - Cholesterol 1 207 36841 1461.6
## - Sugar 1 293 36926 1462.3
## <none> 36634 1465.6
## - Sodium 1 954 37588 1467.4
## - SatFat 1 1630 38263 1472.6
## - Serving 1 2241 38875 1477.2
## - Fiber 1 2904 39538 1482.1
## - Protein 1 173882 210515 1967.0
## - Carbs 1 465541 502175 2219.2
## - Fat 1 2334536 2371170 2669.3
##
## Step: AIC=1460.23
## Calories ~ Fat + SatFat + Cholesterol + Sodium + Carbs + Fiber +
## Sugar + Protein + Serving
##
## Df Sum of Sq RSS AIC
## - Cholesterol 1 201 36868 1456.1 ## - Sugar 1 357 37024 1457.4 ## <none> 36667 1460.2
## - Sodium 1 935 37602 1461.9 ## - SatFat 1 2017 38684 1470.1 ## - Serving 1 2262 38929 1471.9 ## - Fiber 1 2871 39538 1476.4 ## - Protein 1 180272 216940 1970.1 ## - Carbs 1 492671 529339 2228.8
## - Fat 1 2380741 2417409 2669.2
##
## Step: AIC=1456.14
## Calories ~ Fat + SatFat + Sodium + Carbs + Fiber + Sugar + Protein +
## Serving
##
## Df Sum of Sq RSS AIC
## - Sugar 1 325 37193 1453.0 ## <none> 36868 1456.1
## - Sodium 1 808 37676 1456.8
## - SatFat 1 1982 38850 1465.7
## - Serving 1 2301 39169 1468.0
## - Fiber 1 2694 39562 1470.9
## - Protein 1 233337 270205 2028.1
## - Carbs 1 493324 530192 2223.6
## - Fat 1 2562021 2598889 2684.6
##
## Step: AIC=1453.02
## Calories ~ Fat + SatFat + Sodium + Carbs + Fiber + Protein +
## Serving
##
## Df Sum of Sq RSS AIC
## <none> 37193 1453.0
## - Sodium 1
1131 38324 1456.0
## - Serving 1
2850 40043 1468.8
## - SatFat 1
3579 40772 1474.0
## - Fiber 1
4934 42127 1483.5
## - Protein 1
237877 275070 2027.6
## - Fat 1
2677776 2714969 2691.6
## - Carbs 1
##
8554823 8592016 3025.7
## Call:
## lm(formula = Calories ~ Fat + SatFat + Sodium + Carbs + Fiber +
## Protein + Serving, data = cfa)
##
## Coefficients:
## (Intercept) Fat SatFat Sodium Carbs Fiber
## 2.002934 8.664515 0.580707 0.006094 3.799963 0.857368
## Protein Serving
## 3.872489 -0.006290
part c
Perform model selection, using BIC as the metric and forward searching.
Report the predictor variables selected for the model after the first step and for the final model. No need to report the fitted coefficients.
# Use this code chunk for your answer.
step(data = cfa, object = lm(Calories ~ 1, data = cfa),
scope = Calories ~ Fat + SatFat + TransFat + Cholesterol +
Sodium + Carbs + Fiber + Sugar + Protein + Serving, direction = 'forward', k = log(290))
## Start: AIC=3856.55
## Calories ~ 1
##
## Df Sum of Sq RSS AIC
## + Fat 1 146048680 23521053 3289.4 ## + SatFat 1 144469314 25100419 3308.2 ## + Sodium 1 137298325 32271408 3381.1 ## + Protein 1 130049765 39519968 3439.8 ## + TransFat 1 127307413 42262320 3459.3 ## + Carbs 1 96177539 73392194 3619.4 ## + Cholesterol 1 94557536 75012197 3625.7 ## + Fiber 1 86783514 82786219 3654.3 ## + Serving 1 50847203 118722530 3758.8 ## + Sugar 1 21764608 147805125 3822.4
## <none> 169569733 3856.5
##
## Step: AIC=3289.36
## Calories ~ Fat
##
##
Df Sum of Sq RSS AIC
## + Carbs
1 22510439 1010614 2382.3
## + Sugar
1 20202810 3318243 2727.1
## + Serving
1 13606617 9914436 3044.5
## + Fiber
1 2670597 20850456 3260.1
## + SatFat
1 1911124 21609929 3270.5
## + Protein
1 695114 22825939 3286.3
## + Sodium
1 652095 22868958 3286.9
## + TransFat
1 515617 23005436 3288.6
## <none> 23521053 3289.4
## + Cholesterol 1 11977 23509076 3294.9
##
## Step: AIC=2382.3
## Calories ~ Fat + Carbs
##
## Df Sum of Sq RSS AIC
## + Protein 1 961520 49093 1510.8
## + Sodium 1
699414 311199 2046.4
## + Sugar 1
353311 657302 2263.2
## + Cholesterol 1
190365 820249 2327.4
## + SatFat 1
148774 861839 2341.8
## + Fiber 1
30032 980581 2379.2
## <none>
1010614 2382.3
## + TransFat 1
6510 1004104 2386.1
## + Serving 1
405 1010209 2387.9
##
## Step: AIC=1510.84
## Calories ~ Fat + Carbs + Protein
##
## Df Sum of Sq RSS AIC
## + Sugar 1 6704.3 42389 1473.9 ## + Fiber 1 3792.2 45301 1493.2 ## + SatFat 1 3665.2 45428 1494.0 ## + Serving 1 2249.8 46843 1502.9 ## + Sodium 1 1170.3 47923 1509.5
## <none> 49093 1510.8
## + TransFat 1 465.3 48628 1513.8 ## + Cholesterol 1 154.8 48939 1515.6
##
## Step: AIC=1473.93
## Calories ~ Fat + Carbs + Protein + Sugar
## Df Sum of Sq RSS AIC
## + Serving 1 1324.19 41065 1470.4 ## + Fiber 1 1069.50 41319 1472.2 ## + SatFat 1 853.52 41535 1473.7
## <none> 42389 1473.9
## + Sodium 1 291.89 42097 1477.6 ## + TransFat 1 256.37 42133 1477.8 ## + Cholesterol 1 20.31 42369 1479.5
##
## Step: AIC=1470.4
## Calories ~ Fat + Carbs + Protein + Sugar + Serving
##
## Df Sum of Sq RSS AIC
## + SatFat 1 1215.61 39849 1467.3 ## + Fiber 1 1132.19 39933 1468.0
## <none> 41065 1470.4
## + Sodium 1 513.40 40551 1472.4 ## + TransFat 1 425.02 40640 1473.0 ## + Cholesterol 1 4.60 41060 1476.0
##
## Step: AIC=1467.35
## Calories ~ Fat + Carbs + Protein + Sugar + Serving + SatFat
##
## Df Sum of Sq RSS AIC
## + Fiber 1 2172.80 37676 1456.8
## <none> 39849 1467.3
## + Sodium 1 286.74 39562 1470.9 ## + Cholesterol 1 5.53 39844 1473.0 ## + TransFat 1 1.06 39848 1473.0
##
## Step: AIC=1456.76
## Calories ~ Fat + Carbs + Protein + Sugar + Serving + SatFat +
## Fiber
##
## Df Sum of Sq RSS AIC
## + Sodium 1 808.28 36868 1456.1
## <none> 37676 1456.8
## + Cholesterol 1 73.97 37602 1461.9 ## + TransFat 1 12.54 37664 1462.3
##
## Step: AIC=1456.14
## Calories ~ Fat + Carbs + Protein + Sugar + Serving + SatFat +
## Fiber + Sodium
##
## Df Sum of Sq RSS AIC ## <none> 36868 1456.1
## + Cholesterol 1 200.785 36667 1460.2
## + TransFat 1 27.329 36841 1461.6
##
## Call:
## lm(formula = Calories ~ Fat + Carbs + Protein + Sugar + Serving +
## SatFat + Fiber + Sodium, data = cfa)
## Coefficients:
## (Intercept)
Fat
Carbs
Protein
Sugar
Serving
## 1.712288
8.685929
3.897188
3.857196
-0.108422
-0.005800
## SatFat
Fiber
Sodium
## 0.488492
0.730861
0.005302
part d
Perform model selection, using BIC as the metric and stepwise searching.
Report the predictor variables selected for the final model. No need to report the fitted coefficients. Do you select the same models using backward, forward, and stepwise searching?
# Use this code chunk for your answer.
step(data = cfa, object = lm(Calories ~ 1, data = cfa),
scope = Calories ~ Fat + SatFat + TransFat + Cholesterol +
Sodium + Carbs + Fiber + Sugar + Protein + Serving, direction = 'both', k = log(290))
## Start: AIC=3856.55
## Calories ~ 1
##
## Df Sum of Sq RSS AIC
## + Fat 1 146048680 23521053 3289.4 ## + SatFat 1 144469314 25100419 3308.2 ## + Sodium 1 137298325 32271408 3381.1 ## + Protein 1 130049765 39519968 3439.8 ## + TransFat 1 127307413 42262320 3459.3 ## + Carbs 1 96177539 73392194 3619.4 ## + Cholesterol 1 94557536 75012197 3625.7 ## + Fiber 1 86783514 82786219 3654.3 ## + Serving 1 50847203 118722530 3758.8 ## + Sugar 1 21764608 147805125 3822.4
## <none> 169569733 3856.5
##
## Step: AIC=3289.36
## Calories ~ Fat
##
##
Df Sum of Sq RSS AIC
## + Carbs
1 22510439 1010614 2382.3
## + Sugar
1 20202810 3318243 2727.1
## + Serving
1 13606617 9914436 3044.5
## + Fiber
1 2670597 20850456 3260.1
## + SatFat
1 1911124 21609929 3270.5
## + Protein
1 695114 22825939 3286.3
## + Sodium
1 652095 22868958 3286.9
## + TransFat
1 515617 23005436 3288.6
## <none> 23521053 3289.4
## + Cholesterol 1 11977 23509076 3294.9 ## - Fat 1 146048680 169569733 3856.5 ## Step: AIC=2382.3
## Calories ~ Fat + Carbs
##
## Df Sum of Sq RSS AIC
## + Protein 1 961520 49093 1510.8 ## + Sodium 1 699414 311199 2046.4 ## + Sugar 1 353311 657302 2263.2 ## + Cholesterol 1 190365 820249 2327.4 ## + SatFat 1 148774 861839 2341.8 ## + Fiber 1 30032 980581 2379.2 ## <none> 1010614 2382.3
## + TransFat 1 6510 1004104 2386.1 ## + Serving 1 405 1010209 2387.9 ## - Carbs 1 22510439 23521053 3289.4 ## - Fat 1 72381581 73392194 3619.4
##
## Step: AIC=1510.84
## Calories ~ Fat + Carbs + Protein
##
## Df Sum of Sq RSS AIC
## + Sugar 1 6704 42389 1473.9 ## + Fiber 1 3792 45301 1493.2 ## + SatFat 1 3665 45428 1494.0 ## + Serving 1 2250 46843 1502.9 ## + Sodium 1 1170 47923 1509.5 ## <none> 49093 1510.8
## + TransFat 1 465 48628 1513.8 ## + Cholesterol 1 155 48939 1515.6 ## - Protein 1 961520 1010614 2382.3 ## - Fat 1 8224852 8273945 2992.0 ## - Carbs 1 22776846 22825939 3286.3
##
## Step: AIC=1473.93
## Calories ~ Fat + Carbs + Protein + Sugar
##
## Df Sum of Sq RSS AIC
## + Serving 1 1324 41065 1470.4
## + Fiber 1
1070 41319 1472.2
## + SatFat 1
854 41535 1473.7
## <none>
42389 1473.9
## + Sodium 1
292 42097 1477.6
## + TransFat 1
256 42133 1477.8
## + Cholesterol 1
20 42369 1479.5
## - Sugar 1
6704 49093 1510.8
## - Protein 1
614913 657302 2263.2
## - Carbs 1
967691 1010080 2387.8
## - Fat 1
6582934 6625323 2933.3
##
## Step: AIC=1470.4
## Calories ~ Fat + Carbs + Protein + Sugar + Serving
##
## Df Sum of Sq RSS AIC
## + SatFat 1 1216 39849 1467.3
## + Fiber 1
1132 39933 1468.0
## <none>
41065 1470.4
## + Sodium 1
513 40551 1472.4
## + TransFat 1
425 40640 1473.0
## - Serving 1
1324 42389 1473.9
## + Cholesterol 1
5 41060 1476.0
## - Sugar 1
5779 46843 1502.9
## - Protein 1
611594 652659 2266.8
## - Carbs 1
965556 1006621 2392.5
## - Fat 1
6557438 6598503 2937.8
##
## Step: AIC=1467.35
## Calories ~ Fat + Carbs + Protein + Sugar + Serving + SatFat
##
## Df Sum of Sq RSS AIC
## + Fiber 1 2173 37676 1456.8
## <none> 39849 1467.3
## - SatFat 1 1216 41065 1470.4
## + Sodium 1 287 39562 1470.9
## + Cholesterol 1 6 39844 1473.0
## + TransFat 1 1 39848 1473.0
## - Serving 1 1686 41535 1473.7
## - Sugar 1 2876 42725 1481.9
## - Protein 1 605272 645122 2269.1
## - Carbs 1 754075 793925 2329.3
## - Fat 1 3200272 3240121 2737.2
##
## Step: AIC=1456.76
## Calories ~ Fat + Carbs + Protein + Sugar + Serving + SatFat +
## Fiber
##
## Df Sum of Sq RSS AIC
## - Sugar 1 647 38324 1456.0 ## + Sodium 1 808 36868 1456.1 ## <none> 37676 1456.8
## + Cholesterol 1 74 37602 1461.9 ## + TransFat 1 13 37664 1462.3 ## - Serving 1 1969 39645 1465.9
## - Fiber 1 2173 39849 1467.3 ## - SatFat 1 2256 39933 1468.0 ## - Carbs 1 526584 564261 2236.0 ## - Protein 1 606297 643973 2274.3 ## - Fat 1 2977463 3015139 2722.0
##
## Step: AIC=1456.03
## Calories ~ Fat + Carbs + Protein + Serving + SatFat + Fiber
##
## Df Sum of Sq RSS AIC
## + Sodium 1 1131 37193 1453.0
## <none> 38324 1456.0
## + Sugar 1 647 37676 1456.8
## + TransFat 1 82 38242 1461.1
## + Cholesterol 1 33 38291 1461.5
## - Serving 1 2544 40868 1469.0
## - Fiber
1
4401 42725 1481.9
## - SatFat
1
4897 43221 1485.2
## - Protein
1
786333 824657 2340.3
## - Fat
1
3031000 3069323 2721.5
## - Carbs
1
8572105 8610429 3020.6
##
## Step: AIC=1453.02
## Calories ~ Fat + Carbs + Protein + Serving + SatFat + Fiber +
## Sodium
##
## Df Sum of Sq RSS AIC ## <none> 37193 1453.0
## - Sodium 1 1131 38324 1456.0 ## + Sugar 1 325 36868 1456.1
## + Cholesterol 1 169 37024 1457.4 ## + TransFat 1 84 37109 1458.0 ## - Serving 1 2850 40043 1468.8 ## - SatFat 1 3579 40772 1474.0 ## - Fiber 1 4934 42127 1483.5 ## - Protein 1 237877 275070 2027.6 ## - Fat 1 2677776 2714969 2691.6
## - Carbs 1 8554823 8592016 3025.7
##
## Call:
## lm(formula = Calories ~ Fat + Carbs + Protein + Serving + SatFat +
## Fiber + Sodium, data = cfa)
##
## Coefficients:
## (Intercept) Fat Carbs Protein Serving SatFat
## 2.002934 8.664515 3.799963 3.872489 -0.006290 0.580707
## Fiber Sodium
## 0.857368 0.006094
part e
Report the BIC for the final model(s) selected with the three searching methods. Based on the BIC, which model would you select overall?
# Use this code chunk for your answer, if needed.
Exercise 5: Comparing Chick-Fil-A Model Metrics
For this exercise, we’ll continue analyzing the chickfila dataset but now using an exhaustive searching method to identify our optimal model.
part a
First, run the exhaustive searching function. What variables are included in the optimal model with 3 predictor variables? What metric is used to determine the optimal model at each p? Do the optimal models at each p result in nested models for the chickfila data?
# Use this code chunk for your ansswer.
library(leaps)
all_calories_model = summary(regsubsets(Calories ~ Fat + SatFat + TransFat + Cholesterol +
+ Carbs + Fiber + Sugar + Protein + Serving, data = cfa))
all_calories_model
Sodium
## Subset selection object
## Call: regsubsets.formula(Calories ~ Fat + SatFat + TransFat + Cholesterol +
## Sodium + Carbs + Fiber + Sugar + Protein + Serving, data = cfa)
## 10 Variables (and intercept)
## Forced in Forced out
## Fat
FALSE FALSE
## SatFat
FALSE FALSE
## TransFat
FALSE FALSE
## Cholesterol
FALSE FALSE
## Sodium
FALSE FALSE
## Carbs
FALSE FALSE
## Fiber
FALSE FALSE
## Sugar
FALSE FALSE
## Protein
FALSE FALSE
## Serving FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
## Fat SatFat TransFat Cholesterol Sodium Carbs Fiber Sugar Protein
## 1 ( 1 ) "*" " " " " " " " " " " " " " " " "
## 2 ( 1 ) "*" " "
" "
" "
" "
"*"
" "
" "
" "
## 3 ( 1 ) "*" " "
" "
" "
" "
"*"
" "
" "
"*"
## 4 ( 1 ) "*" " "
" "
" "
" "
"*"
" "
"*"
"*"
## 5 ( 1 ) "*" "*"
" "
" "
" "
"*"
"*"
" "
"*"
## 6 ( 1 ) "*" "*"
" "
" "
" "
"*"
"*"
" "
"*"
## 7 ( 1 ) "*" "*"
" "
" "
"*"
"*"
"*"
" "
"*"
## 8 ( 1 ) "*" "*"
## Serving
## 1 ( 1 ) " " ## 2 ( 1 ) " " ## 3 ( 1 ) " " ## 4 ( 1 ) " " ## 5 ( 1 ) " " ## 6 ( 1 ) "*" ## 7 ( 1 ) "*" ## 8 ( 1 ) "*"
" "
" "
"*"
"*"
"*"
"*"
"*"
part b
Calculate the AIC for each of the models selected in part a. Based on AIC, which predictor variables should be included in the optimal model?
# Use this code chunk for your answer.
model1 = lm(data = cfa,
Calories ~ Fat)
model2 = lm(data = cfa,
Calories ~ Fat + Carbs)
model3 = lm(data = cfa,
Calories ~ Fat + Carbs + Protein)
model4 = lm(data = cfa,
Calories ~ Fat + Carbs + Sugar + Protein)
model5 = lm(data = cfa,
Calories ~ Fat + SatFat + Carbs + Fiber + Protein)
model6 = lm(data = cfa,
Calories ~ Fat + SatFat + Carbs + Fiber + Protein + Serving)
model7 = lm(data = cfa, Calories ~ Fat + SatFat +
Sodium + Carbs + Fiber + Protein + Serving)
model8 = lm(data = cfa,
Calories ~ Fat + SatFat + Sodium + Carbs + Fiber + Sugar + Protein + Serving)
extractAIC(model1)
## [1] 2.000 3282.022
extractAIC(model2)
## [1] 3.000 2371.294
extractAIC(model3)
## [1] 4.000 1496.163
extractAIC(model4)
## [1] 5.000 1455.581
extractAIC(model5)
## [1] 6.000 1446.985
extractAIC(model6)
## [1] 7.000 1430.345
extractAIC(model7)
## [1] 8.000 1423.658
extractAIC(model8)
## [1] 9.000 1423.114
part c
Calculate the BIC for each of the models selected in part a. Based on BIC, which predictor variables should be included in the optimal model? Does this match any of the models selected in Exercise 4?
# Use this code chunk for your answer.
extractAIC(model1, k = log(290))
## [1] 2.000 3289.362
extractAIC(model2, k = log(290))
## [1] 3.000 2382.304
extractAIC(model3, k = log(290))
## [1] 4.000 1510.843
extractAIC(model4, k = log(290))
## [1] 5.000 1473.931
extractAIC(model5, k = log(290))
## [1] 6.000 1469.005
extractAIC(model6, k = log(290))
## [1] 7.000 1456.034
extractAIC(model7, k = log(290))
## [1] 8.000 1453.017
extractAIC(model8, k = log(290))
## [1] 9.000 1456.143
part d
Calculate the adjusted R2 for each of the models selected in part a. Based on the adjusted R2, which predictor variables should be included in the optimal model?
#Use this code chunk for your answer.
all_calories_model$adjr2
## [1] 0.8608082 0.9939986 0.9997074 0.9997465 0.9997547 0.9997692 0.9997752 ## [8] 0.9997764
part e
Calculate the RMSE for each of the models selected in part a. Based on the RMSE, which predictor variables should be included in the optimal model?
# Use this code chunk for your answer. sqrt((1/290) * all_calories_model$rss)
## [1] 284.79305 59.03282 13.01105 12.09004 11.87117 11.49571 11.32482 ## [8] 11.27526
part f
Are the same models selected for each of parts b through e? How many different models are selected from the different metrics but with the same exhaustive searching method?
Exercise 6: Formatting
The last five points of the assignment will be earned for properly formatting your final document. Check that you have:
• included your name on the document
• properly assigned pages to exercises on Gradescope
• selected page 1 (with your name) and this page for this exercise (Exercise 6)
• all code is printed and readable for each question
• all output is printed
• generated a pdf file