$25
Introduction To Machine Learning
1. (30 points) Let X = {x1,...,xN} with xt ∈ RD,t = 1,...,N be a given training set. Assume that the dataset is centered, i.e., . We focus on performing linear dimensionality reduction on the dataset using PCA (principal component analysis). With PCA, for each xt ∈ RD, we get zt = Wxt, where zt ∈ Rd,d < D, is the low dimensional projection, and W ∈ Rd×D is the PCA projection matrix. Let Σ = be the sample covariance matrix. Further, let vt = WT zt so that vt ∈ RD.
(a) (10 points) Professor HighLowHigh claims: vt = xt for all t = 1,...,N. Is the claim correct? Clearly explain and prove your answer with necessary (mathematical) details.
(b) (20 points) Professor HighLowHigh also claims:
N N N
Xkxtk22 − Xkvtk22 = Xkxt − vtk22 ,
t=1 t=1 t=1
where for a vector a . Is the claim correct? Clearly explain and prove your answer with necessary (mathematical) details.
2. (30 points) Let Z = {(x1,r1),...,(xN,rN)},xt ∈ Rd,rt ∈ Rk be a set of N training samples. We consider training a multilayer perceptron as shown in Figure 1. We consider a general setting where the transfer functions at each stage are denoted by g, i.e.,
and ,
where ath,ati respectively denote the input activation for hidden node h and output node i. Further, let L(·,·) be the loss function, so that the learning focuses on minimizing:
N k
E(W,V |Z) = XXL(rit,yit) .
t=1 i=1
(a) (10 points) Show that the stochastic gradient descent update for vi,h is of the form vi,hnew = vi,hold + ∆vi,h, with the update
∆vi,h = η∆tizht , where ∆ . (1)
(b) (20 points) Show that the stochastic gradient descent update for wh,j is of the form wh,jnew = wh,jold + ∆wh,j, with the update
∆wh,j = η∆thxtj , where ∆ . (2)
Figure 1: Two layer perceptron.
Programming assignment:
The next problem involves programming. For Question 3, we will be using the 2-class classification datasets from Boston50 and Boston75. In particular, we will develop code for 2-class Support Vector Machines (SVMs) using gradient descent. The goal will be to modify your code for MyLogisticReg2 from HW3.
3. (40 points) We will develop code for 2-class SVMs with parameters (w,w0) where w ∈ Rd,w0 ∈ R. Assume a given dataset {(xt,yt),t = 1,...,N}, where xt ∈ Rd and yt ∈ {−1,1}. Recall from our discussion in class that training SVMs involves minimizing the following objective function:
. (3)
We will use λ = 5 in this assignment.
For reference, compare the objective function to that of regularized logistic regression which you recently worked on as part of HW3:
, (4)
where we had used λ = 0 for the HW3 code.
We will develop code for MySVM2 with corresponding MySVM2.fit(X,y) and MySVM2.predict(X) functions. Parameters for the model can be initialized following what you had done for MyLogisticReg2. In the fit function, the parameters will be estimated using mini-batch stochastic gradient descent with different mini-batch sizes m ≤ n. In particular, you will modify your MyLogisticReg2 code by using gradients for the SVM objective in (3) instead of the logistic regression objective in (4). Further, you will have to add the mini-batch stochastic gradient descent (SGD) functionality which, for a pre-specified mini-batch size m, picks m unique points at random to do the gradient descent in each iteration. We will run experiments with different values of m.
We will compare the performance of MySVM2 for different values of mini-batch size m with
LogisticRegression[1] on two datasets: Boston50 and Boston75. Recall that Boston has 506 data points, and a 5-fold cross-validation leaves n ≈ 400 points for training in each fold.[2]For mini-batch SGD, we will consider three different values of m:
(i) m = 40, which is ≈ 10% of the dataset in each fold for 5-fold cross-validation, (ii) m = 200, which is ≈ 50% of the dataset in each fold for 5-fold cross-validation, and (iii) m = n, which is the full dataset in each fold for 5-fold cross-validation.
Note that m = n uses the full dataset (available for that fold) in each iteration and hence corresponds to the usual gradient descent.[3]
Using my cross val with 5-fold cross-validation, report the error rates in each fold as well as the mean and standard deviation of error rates across all folds for the four methods: MySVM2 with m = 40,m = 200, and m = n, and LogisticRegression, applied to the two 2-class classification datasets: Boston50 and Boston75.
You will have to submit (a) code and (b) summary of results:
(a) Code: You will have to submit code for MySVM2() as well as a wrapper code q3().
For developing MySVM2(), you are encouraged to consult the code for MyLogisticReg2() from HW3. You need to make sure you have init , fit, and predict implemented in MySVM2. init (d,m) will initialize the parameters and will take the data dimensionality d and mini-batch size m as input. You can add additional inputs such as the step size or the convergence threshold. fit(X,y) will take the data features X and labels y and will use mini-batch SGD to estimate the parameters w,w0. predict(X) will take a feature matrix corresponding to the test set and return the predicted labels. Your class MySVM2() will NOT inherit any base class in sklearn.
The wrapper code (main file) has no input and is used to prepare the datasets, and make calls to my cross val(method,X,y,k) to generate the error rate results for each dataset and each method. The code for my cross val(method,X,y,k) must be yours (e.g., code you made in HW1 with modifications as needed) and you cannot use cross val score() in sklearn. The results should be printed to terminal (not generating an additional file in the folder). Make sure the calls to my cross val(method,X,y,k) are made in the following order and add a print to the terminal before each call to show which method and dataset is being used:
i. MySVM2 with m = 40 for Boston50;
ii. MySVM2 with m = 200 for Boston50;
iii. MySVM2 with m = n for Boston50; iv. LogisticRegression for Boston50;
v. MySVM2 with m = 40 for Boston75;
vi. MySVM2 with m = 200 for Boston75;
vii. MySVM2 with m = n for Boston75;
viii. LogisticRegression for Boston75.
*For the wrapper code, you need to make a q3.py file for it, and one should be able to run your code by calling “python q3.py” in command line window.
(b) Summary of results: For each dataset and each method, report the test set error rates for each of the k = 5 folds, the mean error rate over the k folds, and the standard deviation of the error rates over the k folds. Make a table to present the results for each method and each dataset (4 tables in total). Each column of the table represents a fold and add two columns at the end to show the overall mean error rate and standard deviation over the k folds. For example:
Error rates for MySVM2 with m = 40 for Boston50
Fold 1
Fold 2
Fold 3
Fold 4
Fold 5
Mean
SD
#
#
#
#
#
#
#
Additional instructions: Code can only be written in Python; no other programming languages will be accepted. One should be able to execute your code for Q3 directly from command prompt (e.g., “python q3.py”) without the need to run Python interactive shell first. Test your code yourself before submission and suppress any warning messages that may be printed. Your code must be run on a CSE lab machine (e.g., csel-kh1260-01.cselabs.umn.edu). Please make sure you specify the full Python version you are using as well as instructions on how to run your program in the README file (must be readable through a text editor such as Notepad). Information on the size of the datasets, including number of data points and dimensionality of features, as well as number of classes can be readily extracted from the datasets in scikit-learn. Each function must take the inputs in the order specified in the problem and display the output via the terminal or as specified. For Q3, you can submit additional files/functions (as needed) which will be used by the main file. Please put comments in your code so that one can follow the key parts and steps in your code.
Extra Credit Problem:
EC1 (15 points) Consider Problem 2 with specific choices of the activation function g(a). We will assume L(rit,yit) = (rit − yit)2.
(a) (5 points) Let g(u) = max(0,u). What is the gradient g0(a)? What does the update in (1) look like with these specific choices of g(·) and L(·,·)? Clearly explain your answer and show details.
(b) (5 points) For some α ∈ [0,1], let
g(a) = max(0,a) + αmin(0,a) . (5)
What is the gradient g0(a) in terms of α? Clearly explain your answer and show details.
(c) (5 points) For the activation function in (5), is there a specific choice of α ∈ [0,1] which makes the two layer perceptron effectively a linear model, i.e., the predictions yit are a linear function of the inputs xtj? Clearly explain your answer and show details.