$30
1. Linear Regression
In this problem, we will implement least squares linear regression to predict density of wine based on its acidity. Recall that the error metric for least squares is given by:
where hθ(x) = θTx and all the symbols are as discussed in the class. The files linearX.csv and linearY.csv contain the acidity of the wine (x(i)’s, x(i) ∈ R) and its density (y(i)’s, y(i) ∈ R), respectively, with one training example per row. We will implement least squares linear regression to learn the relationship between x(i)’s and y(i)’s.
(a) Implement batch gradient descent method for optimizing J(θ). Choose an appropriate learning rate and the stopping criteria (as a function of the change in the value of J(θ)). You can initialize the parameters as θ = ~0 (the vector of all zeros). Do not forget to include the intercept term. Report your learning rate, stopping criteria and the final set of parameters obtained by your algorithm.
(b) (3 points) Plot the data on a two-dimensional graph and plot the hypothesis function learned by your algorithm in the previous part.
(c) Draw a 3-dimensional mesh showing the error function (J(θ)) on z-axis and the parameters in the x − y plane. Display the error value using the current set of parameters at each iteration of the gradient descent. Include a time gap of 0.2 seconds in your display for each iteration so that the change in the function value can be observed by the human eye.
(d) Repeat the part above for drawing the contours of the error function at each iteration of the gradient descent. Once again, chose a time gap of 0.2 seconds so that the change be perceived by the human eye.(Note here plot will be 2-D)
(e) Repeat the part above (i.e. draw the contours at each learning iteration) for the step size values of η = {0.001,0.025,0.1}. What do you observe? Comment.
2. Sampling and Stochastic Gradient Descent
In this problem, we will introduce the idea of sampling by adding Gaussian noise to the prediction of a hypothesis and generate synthetic training data. Consider a given hypothesis hθ (i.e. known θ0,θ1,θ2) for
a data point . Note that x0 = 1 is the intercept term.
y = hθ(x) = θ0 + θ1x1 + θ2x2 Adding Gaussian noise, equation becomes
where
To gain deeper understanding behind Stochastic Gradient Descent (SGD), we will use the SGD algorithm to learn the original hypothesis from the data generated using sampling, for varying batch sizes. We will implement the version where we make a complete pass through the data in a round robin fashion (after initially shuffling the examples). If there are r examples in each batch, then there is a total of batches assuming m training examples. For the batch number ), the set of examples is given as:
{x(i1),x(i2),··· ,x(ir)} where ik = (b−1)r +k. The Loss function computed over these r examples is given as:
(a) (4 points) Sample 1 million data points taking values of 4) and x2 ∼
N(−1,4) independently, and noise variance in y, σ2 = 2.
(b) (6 points) Implement Stochastic gradient descent method for optimizing J(θ). Relearn
using sampled data points of part a) keeping everything same except the batch size. Keep η = 0.001 and initialize ∀j θj = 0. Report the θ learned each time separately for values of batch size r = {1,100,10000,1000000}. Carefully decide your convergence criteria in each case. Make sure to watch the online video posted on the course website for deciding the convergence of SGD algorithm.
(c) (6 points) Do different algorithms in the part above (for varying values of r) converge to the same parameter values? How much different are these from the parameters of the original hypothesis from which the data was generated? Comment on the relative speed of convergence and also on number of iterations in each case. Next, for each of learned models above, report the error on a new test data of 10,000 samples provided in the file named q2test.csv. Note that this test set was generated using the same sampling procedure as described in part (a) above. Also, compute the test error with respect to the prediction of the original hypothesis, and compare with the error obtained using learned hypothesis in each case. Comment.
(d) (4 points) In the 3 dimensional parameter space(θj on each axis), plot the movement of θ as the parameters are updated (until convergence) for varying batch sizes. How does the (shape of) movement compare in each case? Does it make intuitive sense? Argue.
3. (15 points) Logistic Regression
Consider the log-likelihood function for logistic regression:
For the following, you will need to calculate the value of the Hessian H of the above function.
(a) The files logisticX.csv and logisticY.csv contain the inputs (x(i) ∈ R2) and outputs (y(i) ∈ {0,1}) respectively for a binary classification problem, with one training example per row. Implement[1] Newton’s method for optimizing L(θ), and apply it to fit a logistic regression model to the data. Initialize Newton’s method with θ = ~0 (the vector of all zeros). What are the coefficients θ resulting from your fit? (Remember to include the intercept term.)
(b) Plot the training data (your axes should be x1 and x2 , corresponding to the two coordinates of the inputs, and you should use a different symbol for each point plotted to indicate whether that example had label 1 or 0). Also plot on the same figure the decision boundary fit by logistic regression. (i.e., this should be a straight line showing the boundary separating the region where h(x) > 0.5 from where h(x) ≤ 0.5.)
4. Gaussian Discrmimant Analysis
In this problem, we will implement GDA for separating out salmons from Alaska and Canada. Each salmon is represented by two attributes x1 and x2 depicting growth ring diameters in 1) fresh water, 2) marine water, respectively. File q4x.dat stores the two attribute values with one entry on each row. File q4y.dat contains the target values (y(i)’s ∈ {Alaska, Canada}) on respective rows.
(a) Implement Gaussian Discriminant Analysis using the closed form equations described in class. Assume that both the classes have the same co-variance matrix i.e. Σ0 = Σ1 = Σ. Report the values of the means, µ0 and µ1, and the co-variance matrix Σ.
(b) Plot the training data corresponding to the two coordinates of the input features, and you should use a different symbol for each point plotted to indicate whether that example had label Canada or Alaska.
(c) Describe the equation of the boundary separating the two regions in terms of the parameters µ0,µ1 and Σ. Recall that GDA results in a linear separator when the two classes have identical covariance matrix. Along with the data points plotted in the part above, plot (on the same figure) decision boundary fit by GDA.
(d) In general, GDA allows each of the target classes to have its own covariance matrix. This results (in general) results in a quadratic boundary separating the two class regions. In this case, the maximum-likelihood estimate of the co-variance matrix Σ0 can be derived using the equation:
(1)
And similarly, for Σ1. The expressions for the means remain the same as before. Implement GDA for the above problem in this more general setting. Report the values of the parameter estimates i.e. µ0, µ1, Σ0, Σ1.
(e) Describe the equation for the quadratic boundary separating the two regions in terms of the parameters µ0, µ1 and Σ0, Σ1. On the graph plotted earlier displaying the data points and the linear separating boundary, also plot the quadratic boundary obtained in the previous step.
(f) Carefully analyze the linear as well as the quadratic boundaries obtained. Comment on your observations.
[1] Write your own version, and do not call a built-in library function.