$25
Coding Assignment
Download Rdata files from Piazza or Coursera
The assignment is related to the Boston Housing data. The original data is from the R library “mlbench”, which has 506 observations on 19 variables.
crim zn indus chas nox rm
age dis rad tax ptratio b
lstat
medv cmedv town
tract lon lat
per capita crime rate by town proportion of residential land zoned for lots over 25,000 sq.ft proportion of non-retail business acres per town
Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) nitric oxides concentration (parts per 10 million) average number of rooms per dwelling proportion of owner-occupied units built prior to 1940 weighted distances to five Boston employment centres index of accessibility to radial highways full-value property-tax rate per USD 10,000 pupil-teacher ratio by town
1000(B − 0.63)2 where B is the proportion of blacks by town percentage of lower status of the population median value of owner-occupied homes in USD 1000’s corrected median value of owner-occupied homes in USD 1000’s name of town census tract longitude of census tract latitude of census tract
First, we apply some suggested transformations on the data, then remove three variables medv, tow n, and tract, and use cmedv as the response variable Y.
Consider following 10 procedures:
• Full: run a linear regression model using all features,
• AIC.F and AIC.B: Forward/backward selection with AIC,
• BIC.F and BIC.B: Forward/backward selection with BIC,
• R.min and R.1se: Ridge regression using lambda.min or lambda.1se,
• L.min and L.1se: Lasso using lambda.min or lambda.1se,
• L.Refit: Refit the model selected by Lasso using lambda.1se.
1. Load BostonHousing1.Rdata, which has 16 variables including the response variable Y. The data has been pre-processed, so no need to apply any transformation.
a) Repeat the following simulation 50 times. In each iteration, randomly split the data into two parts, 75% for training and 25% for testing. fit the model based on the training data and obtain a prediction on the test data, record the mean squared prediction error (MSPE) on the test set, the selected model-size or effect dimension (for Ridge), and the computation time for each procedure.
Exclude intercept in computing model-size or effect dimension.
b) Summarize your results on MSPE and model size graphically, e.g., using boxplot or stripchart.
2. Load BostonHousing2.Rdata, which has 135 variables including the response variable Y. In addition to the original 15 predictors, the data contains their quadratic and all pairwise interaction terms.
Repeat (a-b) above for only five methods: R.min, R.1se, L.min L.1se, and
L.Refit.
3. Load BostonHousing3.Rdata, which has 635 variables including the response variable Y. In addition to BostonHousing2.Rdata, the data contains 500 noise features.
Repeat (a-b) above for only five methods: R.min, R.1se, L.min L.1se, and
L.Refit.
(Continue on the next page −→)
A PDF file (maximum t2