• Only electronic submissions will be accepted. Your main PDF writeup must be typeset in LaTeX (please also refer to the “Additional Instructions” below).
• The PDF writeup containing your solution has to be submitted via Gradescope
https://www.gradescope.
com/.
• We have created your Gradescope account (you should have received the notification). Please use your IITK CC ID (not any other email ID) to login. Use the “Forgot Password” option to set your password.
Additional Instructions
• We have provided a LaTeX template file hw2sol.tex to help typeset your PDF writeup. There is also a style file ml.sty that contain shortcuts to many of the useful LaTeX commends for doing things such as boldfaced/calligraphic fonts for letters, various mathematical/greek symbols, etc., and others. Use of these shortcuts is recommended (but not necessary).
• Your answer to every question should begin on a new page. The provided template is designed to do this automatically. However, if it fails to do so, use the clearpage option in LaTeX before starting the answer to a new question, to enforce this.
• While submitting your assignment on the Gradescope website, you will have to specify on which page(s) is question 1 answered, on which page(s) is question 2 answered etc. To do this properly, first ensure that the answer to each question starts on a different page.
• Be careful to flush all your floats (figures, tables) corresponding to question n before starting the answer to question n + 1 otherwise, while grading, we might miss your important parts of your answers.
• Your solutions must appear in proper order in the PDF file i.e. solution to question n must be complete in the PDF file (including all plots, tables, proofs etc) before you present a solution to question n + 1.
(SGD for K-means Objective) Recall the K-means objective function: .
As we have seen, the K- means algorithm minimizes this objective by taking a greedy iterative approach of assigning each point to its closest center (finding the znk’s) and updating the cluster means . The standard K-means algorithm is a batch algorithm and uses all the data in every iteration. It however can be made online by taking a random example xn at a time, and then (1) assigning xn “greedily” to the “best” cluster, and (2) updating the cluster means using SGD on the objective L. Assuming you have initialized randomly and are reading one data point xn at a time,
• How would you solve step 1?
• What will be the SGD-based cluster mean update equations for step 2? Intuitively, why does the update equation make sense?
• Note that the SGD update requires a step size. For your derived SGD update, suggest a good choice of the step size (and mention why you think it is a good choice).
(An Ideal Projection) Suppose we are given some labeled training data {xn,yn} with inputs xn ∈ RD and labels yn ∈ {−1,+1}. We want to project the inputs into one dimension using a projection direction given by a vector w ∈ RD such that, after the projection, the distance between the means of the inputs from the two classes becomes as large possible, and the inputs within each class become as close to each other as possible. Write down the objective/loss function that will achieve this and briefly justify the reason. You don’t need to solve it.
(Eigenchangers!) Suppose we wish to do PCA for an N × D matrix X and assume D > N. The traditional way to do PCA is to compute the eigenvectors of the covariance matrix S = N1X>X (assuming centered data). Show that, if someone instead gives you an eigenvector v ∈ RN of the matrix N1XX>, you can use it to get an eigenvector u ∈ RD of S. What is the advantage of this way of obtaining the eigenvectors of S?
(Latent Variable Models for Supervised Learning) Consider learning a regression model given training data
, with xn ∈ RD and yn ∈ R. Let us give a small twist to the standard probabilistic linear model
for regression that we have seen in the class. In particular, we will be introducing a latent variable zn with each training example (xn,yn), where zn ∈ {1,2,...,K} denotes the cluster which xn belongs to, and assuming a multinoulli prior on zn, i.e., p(zn) = multinoulli(zn|π1,...,πK).
The model also has K weight vectors W = [w1,w2,...,wK]. Given the cluster id zn, we will assume that
. Note that although there are K weight vectors in the overall model,
only one of them is being used to model yn depending on the value of zn. Also note that the model for the responses yn is still discriminative, since the inputs are not being modeled.
The latent variables are Z = (z1,...,zN) and the global parameters are Θ = {(w1,...,wK),(π1,...,πK)}.
(1) Give a brief explanation (max. 5 sentences) of what the above model is doing and why you might want to use it instead of the standard probabilistic linear model which uses a single weight vector w as models each response as p(yn|xn,w) = N(yn|w>xn,β−1).
(2) Derive an ALT-OPT algorithm to estimate Z and (MLE of) Θ, and clearly write down each step’s update equations. For Z, you must give the update equation for each individual latent variable zn,n = 1,...,N.
Likewise, for Θ, you must give the update equation for each wk,k = 1,...,K, and πk,k = 1,...,K. Also, what will be the update of each zn if πk = 1/K,∀k. Give a brief intuitive explanation (max 1-2 sentences) as to what these updates are doing.
(Programming Problem: Part 1) For this problem, your task will be to implement kernel ridge regression (see the solution to mid-sem exam question 6) and ridge regression with landmark based features (let’s call the latter simply “landmark-ridge”), and train and test these two models on the provided dataset (already split into training and test). For both models, you have to use the RBF kernel k(xn,xm) = exp(−γ||xn − xm||2) with bandwidth parameter γ = 0.1. The training and test datasets are given in files ridgetrain.txt and ridgetest.txt. In each file, each line is an example, with first number being the input (a single feature) and the second number being the output.
You need to do the following
1. For kernel ridge regression, train the model with the regularization hyperparameter λ = 0.1 and use the learned model to predict the outputs for the test data. Compare the model’s predictions with the true test outputs (given to you) by plotting on a graph. In particular, plot all the test inputs and the corresponding true outputs (x,y) on a 2D plot in blue color, and likewise all the test inputs and corresponding predicted outputs in red color. Repeat this exercise for λ = 1,10,100. What do you observe from the plots? For each case, also report the root-mean-squared-error (RMSE) on the test data, which is defined as square root of the mean of squared differences of true outputs and the predicted outputs.
2. For landmark-ridge, you first need to extract landmark based features using the RBF kernel and you will then use data with these features to train a linear ridge regression model. Again, keep the regularization hyperparameter fixed at λ = 0.1. Try L = 2,5,20,50,100 uniformly randomly chosen landmark points from the training set, train the model for each case and compute the predictions on the test data. Plot the results for each case just like you did in part 1 (but only for λ = 0.1). What do you observe from the plots? What’s the RMSE in each case? What value of L seems good enough?
(Programming Problem: Part 2) For this problem, your task will be to implement the K-means clustering algorithm and try it on a provided toy (but interesting) dataset (kmeansdata.txt) consisting of points in two dimensions. The provided dataset also has 2 clusters (so you would use K = 2). However, the data is such that the standard K-means will NOT work well since the clusters are not spherical and not separable linearly (you can check this by plotting the data in 2D using a scatter plot). You will consider two ways to handle this issue.
1. Using Hand-crafted Features: Propose a feature transformation to the original data that will transform it such that K-means will be able to learn the correct clusters! Before proposing the transformation, plot the original data to see what transformation(s) might probably work. Apply your K-means implementation on this transformed version of the data to verify if your transformation works. Plot your obtained clustering results by showing points (in the original 2D space) in cluster 1 in red and points in cluster 2 in green.
2. Using Kernels: Although you can kernelize K-mean via the kernel K-means algorithm, you will try something else. You will use the landmark based approach to extract good features and your implementation of standard K-means on these features. The kernel that you will use for the landmark based approach is the RBF kernel k(xn,xm) = exp(−γ||xn − xm||2) with bandwidth parameter γ = 0.1. We are going
Important Note: To avoid randomness in cluster mean initialization, we will choose the first two points in the dataset (the first two lines in the provided dataset) as the initial cluster centers.
(Programming Problem: Part 3) You are provided a subset of MNIST digits data (as a pickle file) with 10,000 examples from digits 0-9 (10 classes). Loading the pickle file will show X and Y fields contain the digit input features (784 dimensional) and labels (0-9), respectively, of the 10,000 examples.
Deliverables: For each of the three parts, you should prepare separate Python notebooks, zip them together in a single zip file (named as
yourrollnumber-problem-5.zip) and upload at the following Dropbox link:
https://www.dropbox.com/request/YKHU6d5RX4nSXDdNV6gd. In the main PDF writeup, write about your observations for each of the three parts and include the plots in the PDF.