Starting from:

$30

DATA 303/473 Assignment 3 -Solved

We use Wage data set which is in the library ISLR2. The Wage data set contains the following variables. 
library(ISLR2) 
#head(Wage) 
summary(Wage) 
## year age maritl race 
## Min. :2003 Min. :18.00 1. Never Married: 648 1. White:2480 
## 1st Qu.:2004 1st Qu.:33.75 2. Married :2074 2. Black: 293 
## Median :2006 Median :42.00 3. Widowed : 19 3. Asian: 190 
## Mean :2006 Mean :42.41 4. Divorced : 204 4. Other: 37 
## 3rd Qu.:2008 3rd Qu.:51.00 5. Separated : 55 
## Max. :2009 Max. :80.00 
## 
## education region jobclass 
## 1. < HS Grad :268 2. Middle Atlantic :3000 1. Industrial :1544 
## 2. HS Grad :971 1. New England : 0 2. Information:1456 
## 3. Some College :650 3. East North Central: 0 
## 4. College Grad :685 4. West North Central: 0 
## 5. Advanced Degree:426 5. South Atlantic : 0 
## 6. East South Central: 0 
## (Other) : 0 
## health health_ins logwage wage 
## 1. <=Good : 858 1. Yes:2083 Min. :3.000 Min. : 20.09 
## 2. >=Very Good:2142 2. No : 917 1st Qu.:4.447 1st Qu.: 85.38 
## Median :4.653 Median :104.92 
## Mean :4.654 Mean :111.70 
## 3rd Qu.:4.857 3rd Qu.:128.68 
## Max. :5.763 Max. :318.34 
## 
In the fifirst part of the assignment. We are interested in wage in relation to year, age and education. This 
is a paired plot. 
pairs(data.frame(Wage$wage, Wage$year, Wage$age, Wage$education)) 
2Wage.wage 
50 
150 
250 
2003 
2005 
2007 
2009 
Wage.year 
Wage.age 
20 
40 
60 
80 





Wage.education 
It is known that year has approximately linear trend and the variable education is a categorical variable. 
We use the natural spline curve fifitting for the trend of age. For this we use function ns() in the splines 
package and lm() function. We fifit the following models 
model1: waga ∼ year + ns(age, df = 1) + education, 
model2: waga ∼ year + ns(age, df = 3) + education, 
model3: waga ∼ year + ns(age, df = 5) + education, 
model4: waga ∼ year + ns(age, df = 7) + education, 
model5: waga ∼ year + ns(age, df = 9) + education. 
(a) ( Fit the model and use anova() function to do the deviance test to compare the models. 
Choose the best model. 
library(splines) 
(b)) Calculate AIC for each model fifitted in (a). Choose the best model using the value of AIC. 
(c)Split the data set (100%) into a training set (70%) and a test set (30%). Then fifit 
model1–model5 on the training set, and calculate the test MSE for each model. Choose the best model. 
set.seed(11) 
(d) By combining the result from (a), (b) and (c), decide the best model. Refifit the chosen 
model using all of the Wage data set. Interpret the out of the summary() function. 
Q2 
Here we will predict the number of applications received Apps using the other variables in the “College” data 
set. 
The data set contains 777 observations on the following 18 variables. 
# Private: A factor with levels No and Yes indicating private or public university 
# Apps: Number of applications received 

2003 
2007 
1 2 3 4 5 
50 
200 
20 40 60 80# Accept: Number of applications accepted 
# Enroll: Number of new students enrolled 
# Top10perc: Pct. new students from top 10% of H.S. class 
# Top25perc: Pct. new students from top 25% of H.S. class 
# F.Undergrad: Number of fulltime undergraduates 
# P.Undergrad: Number of parttime undergraduates 
# Outstate: Out-of-state tuition 
# Room.Board: Room and board costs 
# Books: Estimated book costs 
# Personal: Estimated personal spending 
# PhD: Pct. of faculty with Ph.D.'s 
# Terminal: Pct. of faculty with terminal degree 
# S.F.Ratio: Student/faculty ratio 
# perc.alumni: Pct. alumni who donate 
# Expend: Instructional expenditure per student 
# Grad.Rate: Graduation rate 
library(ISLR) 
## 
## Attaching package: 'ISLR' 
## The following objects are masked from 'package:ISLR2': 
## 
## Auto, Credit 
data(College) 
summary(College) 
## Private Apps Accept Enroll Top10perc 
## No :212 Min. : 81 Min. : 72 Min. : 35 Min. : 1.00 
## Yes:565 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242 1st Qu.:15.00 
## Median : 1558 Median : 1110 Median : 434 Median :23.00 
## Mean : 3002 Mean : 2019 Mean : 780 Mean :27.56 
## 3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902 3rd Qu.:35.00 
## Max. :48094 Max. :26330 Max. :6392 Max. :96.00 
## Top25perc F.Undergrad P.Undergrad Outstate 
## Min. : 9.0 Min. : 139 Min. : 1.0 Min. : 2340 
## 1st Qu.: 41.0 1st Qu.: 992 1st Qu.: 95.0 1st Qu.: 7320 
## Median : 54.0 Median : 1707 Median : 353.0 Median : 9990 
## Mean : 55.8 Mean : 3700 Mean : 855.3 Mean :10441 
## 3rd Qu.: 69.0 3rd Qu.: 4005 3rd Qu.: 967.0 3rd Qu.:12925 
## Max. :100.0 Max. :31643 Max. :21836.0 Max. :21700 
## Room.Board Books Personal PhD 
## Min. :1780 Min. : 96.0 Min. : 250 Min. : 8.00 
## 1st Qu.:3597 1st Qu.: 470.0 1st Qu.: 850 1st Qu.: 62.00 
## Median :4200 Median : 500.0 Median :1200 Median : 75.00 
## Mean :4358 Mean : 549.4 Mean :1341 Mean : 72.66 
## 3rd Qu.:5050 3rd Qu.: 600.0 3rd Qu.:1700 3rd Qu.: 85.00 
## Max. :8124 Max. :2340.0 Max. :6800 Max. :103.00 
## Terminal S.F.Ratio perc.alumni Expend 
## Min. : 24.0 Min. : 2.50 Min. : 0.00 Min. : 3186 
## 1st Qu.: 71.0 1st Qu.:11.50 1st Qu.:13.00 1st Qu.: 6751 
## Median : 82.0 Median :13.60 Median :21.00 Median : 8377 
## Mean : 79.7 Mean :14.09 Mean :22.74 Mean : 9660 
4## 3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00 3rd Qu.:10830 
## Max. :100.0 Max. :39.80 Max. :64.00 Max. :56233 
## Grad.Rate 
## Min. : 10.00 
## 1st Qu.: 53.00 
## Median : 65.00 
## Mean : 65.46 
## 3rd Qu.: 78.00 
## Max. :118.00 
(a) (5 marks) (Create trainig set and test set) Split the data set (100%) into a training set (70%) and a 
test set ( 
set.seed(11) 
(b) ((LASSO) Fit a lasso model on the training set, with λ chosen by cross-validation with the 
1 se rule . Report the test error obtained, along with the of non-zero coeffiffifficient estimates. 
library(glmnet) 
## Loading required package: Matrix 
## Loaded glmnet 4.1-3 
grid <- 10 ˆ seq(4, -2, length = 100) 
• Test MSE 
• Non-zero coeffiffifficient estimates 
(c)Do the best subset selection with BIC and choose the best model. 
library(leaps) 
(d) ( Use all of the College data set, refifit the models chosen by LASSO in (b) and best subset 
selection in (c). Print output of the function summary() for these models. Then compute ‘AIC’ and 
‘BIC’. Between these 2 models, which model is the better model. Give reasons why. 

5

More products