$25
Introduction To Machine Learning
1. (25 points) Consider doing least squares regression based on a training set Ztrain = {(xt,rt),t = 1,...,N}, where xt ∈ R and rt ∈ R.
(i) (10 points) Consider fitting a linear model of the form
g1(x) = w1x + w0 ,
with unknown parameters w1,w0 ∈ R, which are selected so as to minimize the following empirical loss:
N
.
t=1
Derive the optimal values of (w1,w0) clearly showing all steps of the derivation. (ii) (10 points) Consider fitting a polynomial model of the form
g2(x) = v2x2020 + v1x + v0 ,
with unknown parameters v2,v1,v0 ∈ R, which are selected so as to minimize the following empirical loss:
N
.
t=1
Derive the optimal values of v2,v1,v0 clearly showing all steps of the derivation.[1]
(iii) (5 points) For a given training set Ztrain, let ( ) be the optimal values of (w1,w0) in (i) above, and let ( ) be the optimal values of (v2,v1,v0) in (ii) above. Professor Gopher claims that the following is true for any given Ztrain:
Is Professor Gopher’s claim correct? Clearly explain your answer.[2]
2. (15 points) Consider the following 4 × 4 matrix:
1 1 1 1
A = 1 2 4 8 .
1 3 9 27
1 4 16 64
(i) (5 points) What are the values of tr(A),tr(AT),tr(ATA), and tr(AAT).3
(ii) (5 points) From a geometric perspective, explain how the absolute value of |A| (determinant of A) can be computed.
(iii) (5 points) Are the rows of A linearly independent? Clearly explain your answer.2
(For this problem, you can use python libraries to arrive at your answer. If you do that, clearly explain what you did and why. There is a way to arrive at the answer without using python libraries.)
Programming assignments: The next two problems involve programming. We will be considering three datasets (derived from two available datasets) for these assignments:
(a) Boston: The Boston housing dataset comes pre-packaged with scikit-learn. The dataset has 506 points, 13 features, and 1 target (response) variable. You can find more information about the dataset here:
https://github.com/rupakc/UCI-Data-Analysis/tree/master/Boston Housing Dataset/Boston Housing
While the original dataset is for a regression problem, we will create two classification datasets for the homework. Note that you only need to work with the response r to create these classification datasets.
i. Boston50: Let τ50 be the median (50th percentile) over all r (response) values. Create a 2-class classification problem such that y = 1 if r ≥ τ50 and y = 0 if r < τ50. By construction, note that the class priors will be .
ii. Boston75: Let τ75 be the 75th percentile over all r (response) values. Create a 2-class classification problem such that y = 1 if r ≥ τ75 and y = 0 if r < τ75. By construction, note that the class priors will be .
(b) Digits: The Digits dataset comes prepackaged with scikit-learn. The dataset has 1797 points, 64 features, and 10 classes corresponding to ten numbers 0,1,...,9. The dataset was (likely) created from the following dataset:
http://archive.ics.uci.edu/ml/datasets/Pen-Based+Recognition+of+Handwritten+Digits
The 2-class classification datasets from Boston50, Boston75, and the 10-class classification dataset from Digits will be used in the following two problems.
3. (30 points) We will consider three methods from scikit-learn: LinearSVC, SVC, and LogisticRegression. Use the following parameters for the different methods mentioned:
LinearSVC: max iter=2000
SVC: gamma=‘scale’, C=10
LogisticRegression: penalty=‘l2’, solver=‘lbfgs’, multi class=‘multinomial’, max iter=5000
(i) (15 points) Develop code for my cross val(method,X,y,k), which performs k-fold crossvalidation on (X,y) using method, and returns the error rate in each fold. Using my cross val, report the error rates in each fold as well as the mean and standard deviation of error rates across folds for the three methods: LinearSVC, SVC, and LogisticRegression, applied to the three classification datasets: Boston50, Boston75, and Digits.
You will have to submit (a) code and (b) summary of results for my cross val:
(a) Code: You will have to submit code for my cross val(method,X,y,k) (main file) as well as a wrapper code q3i().
The main file has input: (1) method, which specifies the (class) name of one of the three classification methods under consideration, (2) X,y, which is data for the 2-class or 10-class classification problem, (3) k, the number of folds for crossvalidation, and output: (1) the test set error rates for each of the k folds.
The wrapper code has no input and is used to prepare the datasets, and make calls to my cross val(method,X,y,k) to generate the results for each dataset and each method. Make sure the calls to my cross val(method,X,y,k) are made in the following order and add a print to the terminal before each call to show which method and dataset is being used:
1. LinearSVC with Boston50; 2. LinearSVC with Boston75; 3. LinearSVC with
Digits,
4. SVC with Boston50; 5. SVC with Boston75; 6. SVC with Digits,
7. LogisticRegression with Boston50; 8. LogisticRegression with Boston75;
9. LogisticRegression with Digits.
For example, the first call to my cross val(method,X,y,k) with k = 10 should result in the following output:
Error rates for LinearSVC with Boston50:
Fold 1: ###
Fold 2: ###
...
Fold 10: ###
Mean: ###
Standard Deviation: ###
(b) Summary of results: For each dataset and each method, report the test set error rates for each of the k = 10 folds, the mean error rate over the k folds, and the standard deviation of the error rates over the k folds. Make a table to present the results for each method and each dataset (9 tables in total). Include a column in the table for each fold, and add two columns at the end to show the overall mean error rate and standard deviation over the k folds. For example:
Error rates for LinearSVC with Boston50
F1
F2
F3
F4
F5
F6
F7
F8
F9
F10
Mean
SD
#
#
#
#
#
#
#
#
#
#
#
#
(ii) (15 points) Develop code for my train test(method,X,y,π,k), which performs random splits on the data (X,y) so that π ∈ [0,1] fraction of the data is used for training using method, rest is used for testing, and the process is repeated k times, after which the code returns the error rate for each such train-test split. Using my train test, with π = 0.75 and k = 10, report the mean and standard deviation of error rate for the three methods: LinearSVC, SVC, and LogisticRegression, applied to the three classification datasets: Boston50, Boston75, and Digits.
You will have to submit (a) code and (b) summary of results for my train test:
(a) Code: You will have to submit code for my train test(method,X,y,π,k) (main file) as well as a wrapper code q3ii().
This main file has input: (1) method, which specifies the (class) name of one
of the three classification methods under consideration, (2) X,y, which is data for the 2-class or 10-class classification problem, (3) π, the fraction of data chosen randomly to be used for training, (4) k, the number of times the train-test split will be repeated, and output: (1) the test set error rates for each of the k folds printed to the terminal.
The wrapper code has no input and is used to prepare the datasets, and make calls to my train test(method,X,y,π,k) to generate the results for each dataset and each method (9 combinations in total). Make sure the calls to my train test(method,X,y,π,k) are made in the following order and add a print to the terminal before each call to show which method and dataset is being used:
1. LinearSVC with Boston50; 2. LinearSVC with Boston75; 3. LinearSVC with
Digits,
4. SVC with Boston50; 5. SVC with Boston75; 6. SVC with Digits,
7. LogisticRegression with Boston50; 8. LogisticRegression with Boston75;
9. LogisticRegression with Digits.
(b) Summary of results: For each dataset and each method, report the test set error rates for each of the k = 10 runs with π = 0.75, the mean error rate over the k folds, and the standard deviation of the error rates over the k folds. Make a table to present the results for each method and each dataset (9 tables in total). Include a column in the table for each run, and add two columns at the end to show the overall mean error rate and standard deviation over the k runs.
4. (30 points) The problem considers a preliminary exercise in ‘feature engineering’ with focus on the Digits dataset. Represented as (X,y), the Digits dataset has X ∈ R1797×6[3], i.e., 1797 training points, each having 64 features, and y ∈ {0,1,...,9}1797, i.e., 1797 training labels with each yi ∈ {0,1,...,9}. We will consider three methods from scikit-learn: LinearSVC, SVC, and LogisticRegression for this problem. Use the following parameters for the different methods mentioned:
LinearSVC: max iter=2000
SVC: gamma=‘scale’, C=10
LogisticRegression: penalty=‘l2’, solver=‘lbfgs’, multi class=‘multinomial’, max iter=5000
(i) (15 points) For the Digits dataset, starting with X ∈ R1797×64, you will create a new feature representation X˜1 ∈ R1797×32 as follows: Construct a (random) matrix G ∈ R64×32 where each element gij ∼ N(0,1), i.e., sampled independently from a univariate normal distribution, and then compute X˜1 = XG. Using (X˜1,y), perform 10-fold crossvalidation4 using the three methods: LinearSVC, SVC, and LogisticRegression, and report the mean and the standard deviation of the 10-fold test set error rate.[4] The creation of X˜1 will be done based on a function rand proj(X,d), where d = 32 for this problem, and the function will return X˜1.
(ii) (15 points) For the Digits dataset, starting with X ∈ R1797×64, you will create a new feature representation X˜2 ∈ R1797×2144 as follows: For any training data xi ∈ R64, let the elements be xij,j = 1,...,64. The new feature set ˜xi ∈ R2144 will include all the original features xij,j = 1,...,64, squares of the original features x2ij,j = 1,...,64, and products of all the original features xijxij0,j < j0,j = 1,...,64,j0 = j+1,...,64. You should verify that the new ˜xi ∈ R2144 and hence X˜2 ∈ R1797×2144. Using (X˜2,y), perform 10-fold cross-validation4 using the three methods: LinearSVC, SVC, and LogisticRegression, and report the mean and the standard deviation of the 10-fold test set error rate. The creation of X˜2 will be done based on a function quad proj(X), and the function will return X˜2.
You will have to submit (a) code and (b) summary of results for all three parts:
(a) Code: You will have to submit code for rand proj(X,d), quad proj(X) as well as a wrapper code q4().
rand proj(X,d) has input: (1) X, which is data (features) for the classification problem, (2) d, the dimensionality of the projected features, and output: (1) X˜ ∈ R1797×d, the new data for the problem. This output array does not need to be printed to the terminal. quad proj(X) has input: X, which is data (features) for the classification problem, and output: (1) X˜2, the new data with all linear and quadratic combinations of features as described above. This output array does not need to be printed to the terminal.
The wrapper code has no input and uses these above functions to execute all the classification exercises outlined in (i) and (ii) above and print the test set error rates for each of the k folds to the terminal. Make sure the exercises are executed in the following order and add a print to the terminal before each execution to show which method and dataset is being used:
1. LinearSVC with X˜1; 2. LinearSVC with X˜2,
3. SVC with X˜1; 4. SVC with X˜2,
5. LogisticRegression with X˜1; 6. LogisticRegression with X˜2.
(b) Summary of results: For each dataset, i.e., X˜1 and X˜2, and each method, report the mean error rate over the k folds, and the standard deviation of the error rates over the k folds. Make a table to present the results for each method and each dataset (6 tables in total). Include a column in the table for each fold, and add two columns at the end to show the overall mean error rate and standard deviation over the k folds.