$25
PROBLEMS 1. (80%) Linear Regression
(a) (10%) Split train.csv into training set (80%) and validation set (20%). Both the training and validation set should be normalized by subtracting the (column-wise) means of training set from them and then divided by the (column-wise) standard deviations of the training set. Please elaborate on how you obtain your training and test sets in your report. Notice that you should use identical training and test sets for (b) - (e).
(b) (10%) Implement a linear regression model without the bias term to predict G3. Use pseudoinverse to obtain the weights. Record the root mean squared error (RMSE) of the test set.
(c) (10%) Regularization is often adopted to avoid over-fitting. Regularization for linear regression model by adding an additional term in your function J(w):
λ >
J(w) = MSEtrain + w w 2
Implement a regularized linear regression model without the bias term where λ = 1.0. Please describe how to find the optimal weights in your report. Record the RMSE of the test set.
(d) (10%) Repeat (c) but include the bias term in your model.
(e) (10%) Follow Example: Bayesian Linear Regression in the textbook (Chapter 5) and implement a Bayesian linear regression model with the bias term. Let µ0 = 0 and I in (5.78) where α = 1.0. Use the mean of the posterior as weights for your model. Record the RMSE of the test set.
1
Figure 1: regression result comparison. (The figure is just an example. You don’t need to be same as above.)
(f) (20%) Plot the ground truth (real G3) versus all predicted values generated by models (b) (e) as examplified in Figure 1. Please compare the RMSEs and predicted G3 values in your report. Also, please explain mathematically why predicted G3 values are closer to the ground truth for (d) and (e).
(g) (10%) Apply the model from 1. (e) to test no G3.csv and save your results as StudentID 1.txt. You are allowed to tune α.
2. (20%) Census Income Data Set
(a) Try to do 1. (a)-(e) on Census Income Data Set (adult.data and adult.test, for more details, check https://archive.ics.uci.edu/ml/datasets/Census+Income). α is tunable. Predict target is the last column (>50K, ≤50K). Describe your finding.