Instructions:
• Only electronic submissions will be accepted. Your main PDF writeup must be typeset in LaTeX (please also refer to the “Additional Instructions” below).
• The PDF writeup containing your solution has to be submitted via Gradescope
https://www.gradescope.
com/.
• We have created your Gradescope account (you should have received the notification). Please use your IITK CC ID (not any other email ID) to login. Use the “Forgot Password” option to set your password.
Additional Instructions
• We have provided a LaTeX template file hw1sol.tex to help typeset your PDF writeup. There is also a style file ml.sty that contain shortcuts to many of the useful LaTeX commends for doing things such as boldfaced/calligraphic fonts for letters, various mathematical/greek symbols, etc., and others. Use of these shortcuts is recommended (but not necessary).
• Your answer to every question should begin on a new page. The provided template is designed to do this automatically. However, if it fails to do so, use the clearpage option in LaTeX before starting the answer to a new question, to enforce this.
• While submitting your assignment on the Gradescope website, you will have to specify on which page(s) is question 1 answered, on which page(s) is question 2 answered etc. To do this properly, first ensure that the answer to each question starts on a different page.
• Be careful to flush all your floats (figures, tables) corresponding to question n before starting the answer to question n + 1 otherwise, while grading, we might miss your important parts of your answers.
• Your solutions must appear in proper order in the PDF file i.e. solution to question n must be complete in the PDF file (including all plots, tables, proofs etc) before you present a solution to question n + 1.
(Another Simple Classifier) Consider a classification model where we are given training data from K classes. Each input xn ∈ RD and each class c is defined by two parameters, wc ∈ RD and a D × D positive definite (PD) matrix Mc, c = 1,2,...,K. Assume Nc denotes the number of training examples from class c. Suppose we estimate wc and Mc by solving the following optimization problem
(wˆc,Mˆ c) = arg min X 1 (xn − wc)>Mc(xn − µc) − log|Mc|
wc,Mcxn: n Nc y =c
(note that, in the above objective, the log|Mc| term ensures positive definiteness of Mc because the determinant of a PD matrix is always non-negative)
For the given objective/loss function, find the optimal values of wc and Mc. Also, what will this model reduce to as a special case when Mc is an identity matrix?
(Consistent or Not?) An important notion for a classifier is that of consistency. A classification algorithm is said to be consistent if, whenever it has access to infinite amounts of training data, its error rate approaches the optimal error rate (a.k.a. Bayes optimal). Consider the noise-free setting (i.e., every training input is labeled correctly). Here, the Bayes optimal error rate is zero. Is the one-nearest-neighbor algorithm consistent in this setting? Briefly justify your answer in 100 words or less.
(Linear Regression viewed as Nearest Neighbors) Show that, for the unregularized linear regression model, where the solution wˆ = (X>X)−1X>y, the prediction at a test input x∗ can be written as a weighted sum of all the training responses, i.e.,
N f(x∗) = Xwnyn
n=1
Give the expression for the weights wn’s in this case and briefly discuss (<50 words) in what way these weights are different from the weights in a weighted version of K nearest neighbors where each wn typically is the inverse distance of x∗ from the training input xn. Note: You do not need to give a very detailed expression for wn (if it makes algebra messy) but you must give a precise meaning as to what wn depends on and how it is different from the weights in the weighted K nearest neighbors.
(Feature Masking as Regularization) Consider linear regression model by minimizing the squared loss function . Suppose we decide to mask out or “drop” each feature xnd of each input xn ∈ RD, independently, with probability 1 − p (equivalently, retaining the feature with probability p). Masking or dropping out basically means that we will set the feature xnd to 0 with probability 1 − p. Essentially, it would be equivalent to replacing each input xn by x˜n = xn ◦mn, where ◦ denotes elementwise product and mn denotes the D×1 binary mask vector with mnd ∼ Bernoulli(p) (mnd = 1 means the feature xnd was retained; mnd = 0 means the feature xnd was masked/zeroed).
Let us now define a new loss function using these masked inputs as follows: . Show that minimizing the expected value of this new loss function (where the expectation is used since the mask vectors mn are random) is equivalent to minimizing a regularized loss function. Clearly write down the expression of this regularized loss function.
//
tinyurl.com/cs771-a23-hw1dat (caution: the data is in a couple of hundred MBs; there is also a README file that contains the description of the data). In this dataset, each input represents an image of an animal and output is the class (what this animal is). The dataset has a total of 50 classes and the number of features for each input is 4096 (note: these features were extracted by a deep learning model and you don’t need to perform any other preprocessing of features for this problem).
However, we are going to give a small twist to the basic prototype based classification problem. The training set provided to you has only examples from 40 of the classes. We will refer to these 40 classes as “seen classes” (have training data for these classes) and the remaining 10 classes as “unseen classes” (do not have training data for these classes). The test inputs will be only from these 10 unseen clases.
Recall that prototype based classification requires computing the mean of each class. While computing the means of the 40 seen classes is easy (since we have training data from these classes), what we actually need is the mean of the remaining 10 classes (since these are our test classes). How do we get these means?
• Method 1: Model the mean µc ∈ R4096 of each unseen class c (where c = 41,...,40) as a convex combination of the means µ1,...,µ40 of the 40 seen classes (again note that µ1,...,µ40 can be computed easily since we do have training data from these 40 classes)
40
µc = Xsc,kµk, c = 41,...,50
k=1
where sc = [sc,1,sc,2,...,sc,40] is a vector of similarities of the unseen class c with each of the 40 seen classes. Here each similarity is defined as the inner product of the class attribute vectors of two classes, e.g., is the similarity between two clases c and k. Note: We will also normalize the vector sc as so that it sums to 1 (and thus can be used as weights in the convex combination defined above). This procedure will give us the means of 10 unseen classes and then you can apply the prototype based classifer to predict the labels of each test input.
• Method 2: Train a linear model that can predict the mean µc ∈ R4096 of any class using its class attribute vector ac ∈ R85. We can train this linear model using as our training data and then apply
it to predict µc for each unseen class using its class attribute vector ac. Note that this can be posed as a multi-output regression problem µc = Wac where ac ∈ R85 is the input, µc ∈ R4096 is the vectorvalued regression output, and W is 85 × 4096 matrix of weights that need to be learned. The solution to this multi-output regression problem is W = (A>s As + λI)−1A>s Ms where As is the 40 × 85 matrix containing the class attribute vectors of 40 seen classes, and Ms is the 40 × 4096 matrix containing the means of the 40 seen classes. Once you have W, you can compute the mean µc of any unseen class as Wac where ac is its class attribute vector. This procedure will give us the means of 10 unseen classes and then you can apply the prototype based classifer to predict the labels of each test input.
You task is to implement both methods. In your main write-up, report the test-set classification accuracy for each by comparing the respective model’s predictions with the provided ground truth labels (accuracy is the percentage of test inputs with correctly predicted classes). For method 2, try λ = 0.01,0.1,1,10,20,50,100 and report the test accuracy for each value of λ. Which value of λ gives the best test set accuracy?
Note: In the dataset folder, I’ve also provided a brief pseudo-code (or rather a summary) of the procedure to follow for the implementation (since this is the very first assignment, I felt it would be helpful for you all).
Important: To predict the class of a test input, you should only compute its distances from the means of 10 unseen classes, not from the means of the seen classes (since the test data only consists of inputs from the unseen classes). The other setting where the test data can consist of both seen and unseen class inputs is also possible but we are not considering it here.
Deliverables: The code for each method, submitted as separate files. For method 1, name the file
convex.py; likewise, for method 2, name the file
regress.py. Your codes should be easily readable. Please use comments wherever necessary. Also include a README file to briefly describe how to run the code. All the code and README should be zipped together and submitted as a single file named
yourrollnumber.zip. Please DO NOT submit the data provided.