$25
Problem 1
Download: Codebase
In this coding problem, we will implement the closed-form solution to linear regression. The student will use matrix-vector representations for data representation and get some experience with data visualization. In addition, the student will explore the phenomenon of “regression to the mean.”
We generate data with two steps:
Independently generate
Rotate to a degree (e.g.,
The data generation is given by the generate_data(M, var1, var2, degree) function.
After data generation, we will obtain these two plots of data. In the second plot, the dashed line is where data are mostly generated around.
Now, we would like to do two linear regressions: 1. Predict y from x (denoted by x2y), and
Predict x from y (denoted by y2x).
To accomplish this, the student needs to implement the leastSquares(X, Y) function.
Note: In the function leastSquares(X, Y), X is simply the input and Y is simply the output. It is not related to x2y regression or y2x regression.
Then, we would like to use the learned models x2y and y2x to predict a series of points in (-4, 4) as input, and plot the input and the predicted output.
To accomplish this, the student needs to
Implement feature augmentation, i.e., each data sample is concatenated with a dummy feature 1. Notice the augmented feature during prediction has to comply with that in training.
Implement the prediction function model(X, w), where the definition of w should comply with that in the leastSquares
Plot the learned models. Notice that a linear regression from is just a line. The student is required to plot x2y regression in red and y2x regression in green. And, these two models MUST be plotted in the same x-y coordinate system, with x being the horizontal axis and y being the vertical axis.
Problem 2 [50%]
Download: 2a, 2b, 2c
In this problem, we will explore gradient descent optimization for linear regression, applied to the Boston house price prediction. The dataset is loaded by import sklearn.datasets as datasets
where sklearn is a popular machine learning toolkit. Unfortunately, in the coding assignment, we CANNOT use any other API functions in existing machine learning toolkits including sklearn. Again, we shall use linear algebra routines (e.g., numpy) for the assignment.
The dataset contains 506 samples, each with 13 features. We first randomly shuffle all samples, and then take 300 samples as the training set, 100 samples as the validation test, and 106 as the test set.
We normalize features and output by
We use mean square error as our loss to train our model. The measure of success, however, is the mean of the absolute difference between the predicted price and true price. Here, we call the measure of success the risk or error. This reflects how much money we would lose for a bad prediction. The lower, the better.
In other words, the training loss is
where we compute loss on the normalized output and the prediction .
The measure of success (the lower, the better) is
Here, the risk is defined on the original output (thinking of it’s the real money).
Notice that we will use mini-batch gradient descent, and thus, should be the number of samples in a batch.
We implement the train-validation-test framework, where we train the model by mini-batch gradient descent, and validate model performance after each epoch. After reaching the maximum number of iterations, we pick the epoch that yields the best validation performance (the lowest risk), and test the model on the test set.
Without changing default hyperparameters, we report three numbers
The number of epoch that yields the best validation performance, 2. The validation performance (risk) in that epoch, and
The test performance (risk) in that epoch. and two plots:The learning curve of the training loss, and
The learning curve of the validation risk.
where x-axis is the number of epochs, and y-axis is training loss and validation risk, respectively.
[10%] We now explore non-linear features in the linear regression model. In particular, we adopt point-wise quadratic features. Suppose the original features are
. We now extend it as .
At the same time, we tune -penalty to prevent overfitting by
The hyperparameter should be tuned from the set {3, 1, 0.3, 0.1, 0.03, 0.01}.
Report the best hyperparameter , i.e., the one yields the best performance, and under this hyperparameter, the three numbers and two plots required in Problem 2(a). (c) [10%] Ask a meaningful scientific question on this task by yourself, design an experimental protocol, report experimental results, and draw a conclusion.