Starting from:

$20

EE270-Assignment2 Large Scale Matrix Computation, Optimization and Learning Solved

1.    Singular Value Decomposition (SVD) for Compression 
Singular value decomposition (SVD) factorizes a m × n matrix X as X = UΣV , where U ∈ Rm×m and UU = UU = I, Σ ∈ Rm×n contains non-increasing non-negative values along its diagonal and zeros elsewhere, and V ∈ Rn×n and V V = V V = I. Hence matrix can be represented as where ui denotes the ith column of U and vi denotes the ith column of V . Download the image[1] and load it as the matrix X (in greyscale).

(a)     Perform SVD on this matrix and zero out all but top k singular values to form an approximation X˜. Specifically, compute , display the resulting approximation

X˜ as an image, and report  .

(b)    How many numbers do you need to describe the approximation  for k ∈ {2,10,40} ?

Hint: Matlab can load the greyscale image using

X=double(rgb2gray(imread(’harvey-saturday-goes7am.jpg’))) and display the image using imagesc()

In Python 2.7 you can use: import numpy, Image

X=numpy.asarray(Image.open(’harvey.jpg’).convert(’L’))

2.     Singular Value Decomposition and Subspaces (20 pts) Let A be an m × n singular matrix of rank r with SVD
                                                                                                    σ1                                                                       



                                                                                                              ...                                                            ~v



                                                                                                                   σr                                                                           ~v

A = UΣV T = ~um 

                                                                                                                          0



                                                                                                                                                      

                                                                                                                                                      

0

                                                                            σ1                                                                     

                                                                                       ...                                   

                                                                                               σ                                               

                                           =                                                    r                                      

                                                                                                        0                    

                                                                                                                 ...       

0

~u1~u2

...









Uˆ     U˜                                                              Vˆ˜TT

                                                                                      V







where σ1 ≥ ... ≥ σr 0, Uˆ consists of the first r columns of U, U˜ consists of the remaining m − r columns of U, Vˆ consists of the first r columns of V , and V˜ consists of the remaining n − r columns of V . Give bases for the spaces range(A), null(A), range(AT) and null(AT) in terms of the components of the SVD of A, and a brief justification.

3.    Least Squares (20 pts)
Consider the least squares problem minx ||b − Ax||2. Which of the following statements are necessarily true?

(a)    If x is a solution to the least squares problem, then Ax = b.

(b)    If x is a solution to the least squares problem, then the residual vector r = b − Ax is in the nullspace of AT.

(c)    The solution is unique.

(d)    A solution may not exist.

(e)    None of the above.

4.    Binary Classification via Regression
Download the MNIST dataset from http://yann.lecun.com/exdb/mnist/.

In binary classification, we restrict Y to take on only two values. Suppose Y ∈ 0,1. Now let us use least squares linear regression for classification. Let us consider the classification problem of recognizing if a digit is 2 or not using linear regression. Here, let Y = 1 for all the 2’s digits in the training set, and use Y = 0 for all other digits.

Build a linear classifier by minimizing:

 

using the training set {(x1,y1),...,(xn,yn)}. Use appropriate regularization (λ 0) as needed. Let xi be a vector of the pixel values of the image.

For the purpose of classification, we can label a digit as a 2 if wTz is larger than some threshold.

(a)    Based on your training set, choose a reasonable threshold for classification. What is your 0/1 loss on the training set? What is your square loss on the training set?

(b)    On the test set, what is the 0/1 loss? What is the square loss?

5.    Leave-One-Out Cross Validation (LOOCV)
Assume that there are n training examples, (x1,y1),(x2,y2),...,(xn,yn), where each input data point xi , has d real valued features. The goal of regression is to learn to predict yi from xi . The linear regression model assumes that the output y is a linear combination of the input features plus Gaussian noise with weights given by w.

The maximum likelihood estimate of the model parameters w (which also happens to minimize the sum of squared prediction errors) is given by the normal equations we saw in class: (XTX)−1XTy. Here, the rows of the matrix X are the training examples . Define Yˆ to be the vector predictions using ˆw if we were to plug in the original training set X:

yˆ = Xwˆ

= X(XTX)−1XTy = Hy

where we define H = X(XTX)−1XT (H is often called the Hat Matrix). Show that H is a projection matrix.

As mentioned above, ˆw, also minimizes the sum of squared errors:

n

SSE = X(yi − yˆi)2

i=1

Leave-One-Out Cross Validation score is defined as:

n

LOOCV = X(yi − yˆi(−i))2

i=1

where ˆy(−i) is the estimator of y after removing the ith observation from the training set, i.e., it minimizes X          − yˆj(−i))2.

(yj

j6=i

(a)    What is the time complexity of computing the LOOCV score naively? (The naive algorithm is to loop through each point, performing a regression on the n−1 remaining points at each iteration.)

Hint: The complexity of matrix inversion using classical methods is O(k3) for a k × k matrix.

(b)    Write ˆyi in terms of H and y.

(c)    Show that ˆy(−i) is also the estimator which minimizes SSE for Z where

 

(d)    Write ˆ  in terms of H and Z. By definition, ˆ  , but give an answer that is analogous to (b).

(e)    Show that ˆ  , where Hii denotes the ith element along the diagonal of H.

(f)     Show that

 

What is the algorithmic complexity of computing the LOOCV score using this formula? Note: We see from this formula that the diagonal elements of H somehow indicate the impact that each particular observation has on the result of the regression.

6. Sketching Least Squares  Here, we will consider the empirical performance of random sampling and random projection algorithms for approximating least-squares.

Let A be an n × d matrix, with   be an n-vector, and consider approximating the solution to minx kAx − bk2. Generate the matrices A from one of three different classes of distributions introduced below

•    Generate a matrix A from multivariate normal N(1d,Σ), where the (i, j)th element of Σij = 2 × 0.5|i−j|.(Refer to as GA data.)

•    Generate a matrix A from multivariate t-distribution with 3 degree of freedom and covariance matrix Σ as before. (Refer to as T3 data.)

•    Generate a matrix A from multivariate t-distribution with 1 degree of freedom and covariance matrix Σ as before. (Refer to as T1 data.) To start, consider matrices of size n × d equal to 500 × 50.

(a)    First, for each matrix, consider approximating the solution by randomly sampling a small number r of rows/elements (i.e., constraints of the overconstrained least-squares problem) in one of three ways: uniformly at random; according to an importance sampling distribution that is proportional to the Euclidean norms squared of the rows of A; and according to an importance sampling distribution that is proportional to the leverage scores of A. In each case, plot the error as a function of the number r of samples, paying particular attention to two regimes: r = d,d+1,d+2,...,2d; and r = 2d,3d,.... Show that the behavior of these three procedures is most similar for GA data, intermediate for T3 data, and most different for T1 data; and explain why, and explain similarities and differences.

(b)    Next, for each matrix, consider approximating the solution by randomly projecting rows/elements (i.e., constraints of the overconstrained least-squares problem) in one of two ways: a random projection matrix in which each entry is i.i.d. {±1}, with appropriate variance; and a random projection matrix, in which each entry is i.i.d. Gaussian, with appropriate variance. In each case, plot the error as a function of the number of samples, i.e., dimensions on which the data are projected, paying particular attention to two regimes: r = d,d + 1,d + 2,...,2d; and r = 2d,3d,.... Describe and explain similarities and differences between these two procedures for GA data, T1 data, and T3 data.

(c)    Finally, for each matrix, consider approximating the solution by randomly projecting rows/elements (i.e., constraints of the overconstrained least-squares problem) with sparse projection matrices. In particular, consider a random projection matrix, in which each entry is i.i.d. either 0 or Gaussian, where the probability of 0 is q and the probability of Gaussian is 1 − q. (Remember to rescale the variance of the Gaussian appropriately, depending on q) For q varying from 0 to 1, in increments sufficiently small, plot the error for solving the least-squares problem. Describe and explain how this varies as a function of the number of samples, i.e., dimensions on which the data are projected, paying particular attention to two regimes: r = d,d+1,d+2,...,2d; and r = 2d,3d,.... Describe and explain similarities and differences between these three procedures for GA data, T1 data, and T3 data.

Next, we describe how these behave for larger problems. To do so, we will work with dense random projection algorithms in which the projection matrix consists of i.i.d. {±1} random variables, with appropriate variance. Fix a value of d, and let n increase from roughly 2d to roughly 100d. The exact value of d and n will depend on your machine, your computational environment, etc., so that you get reasonably good low-precision approximate solutions to the original least-squares problem. (You should expect d ≈ 500 should work; and if you can’t do the full plot to 100d, don’t worry, since the point is to get large enough to illustrate the phenomena below.)

(d)    Plot the running time of the random projection algorithm versus the running time of solving the problem with a call to a QR decomposition routine provided by your system as well as the running time of solving the problem with a call to an SVD routine provided by your system. Illustrate that, for smaller problems the random projection methods are not faster, but that for larger problems, the random projection methods are slightly faster and/or can be used to solver larger problems than QR or SVD. Explain why is this the case.


 
[1] https://www.nasa.gov/sites/default/files/thumbnails/image/harvey-saturday-goes7am.jpg

(NOAA’s GOES-East satellite’s image of Hurricane Harvey in the western Gulf of Mexico on Aug. 26, 2017. Credit:

NASA/NOAA GOES Project)

More products