Starting from:

$20

EE270-Assignment3 Large Scale Matrix Computation, Optimization and Learning Solved

1.    Logistic Regression
In this question, we will study logistic regression, which is a popular method in machine learning. We let w ∈ Rp be our decision variable. w represents weights for the columns of the n × p data matrix X, where the ith row of X is xTi . Let

 

be the sigmoid/logistic function. This function is non-convex. However −log(σ(a)) is convex, which arises in maximum likelihood estimation as we describe next.

We assume that we collect data xi, and a binary response variable yi, which is the label, where

(+1 with probability σ(wT xi)

                                                                  yi =         −1 with probability 1 − σ(wT xi)         .                                                  (1)

We can write the above in the compact form p(yi|w,xi) = σ(yiwT xi) since σ(a) = 1 − σ(−a). If we collect the observations , then the probability of observing this outcome is ) and so the negative log-likelihood, which we will use as our

objective function, is

 .

Note that minimizing negative log-likelihood is same as maximizing the likelihood. This corresponds to maximum likelihood estimation of the parameter w. Once w is identified, we can use (1) to infer the label of a test data point x.

(a)     Derive the gradient ∇`(w).

(b)    Derive the Hessian ∇2`(w).

(c)     Is the cost function `(w) convex?

2.    Logistic Regression for Spam E-mail Classification (20 pts)
Download the email spam data set , which is available at https://github.com/probml/ pmtk3/tree/master/data/spamData in Matlab and plain text format. This set consists of 4601 email messages, from which 57 features have been extracted. These are as follows

•    48 features, in [0 100], giving the percentage of words in a given message which match a given word on the list. The list contains words such as “business”, “free”, “george”, etc.

•    6 features, in [0 100], giving the percentage of characters in the email that match a given character on the list. The characters are ; ( [ ! $ #

•    Feature 55: The average length of an uninterrupted sequence of capital letters (max is 40.3, mean is 4.9) • Feature 56: The length of the longest uninterrupted sequence of capital letters ( max is 45, mean is 52.6) • Feature 57: The sum of the lengths of uninterrupted sequence of capital letters (max is 25.6, mean is 282.2)

Load the data set using the provided link. In the Matlab version, there is a training set of size 3065 and a test set of size 1536. In the plain text version, there is no predefined testing/training set, you can randomly shuffle the data to pick a training set of size 3065 and use the rest for testing (or you can use the Matlab data along with scipy.io.loadmat).

There are different methods to pre-process the data, e.g., standardize the columns of X. For this problem, transform the features using log(xij + 0.1). One could also add some regularization to the loss function which can help generalization error but this is not necessary. Also note that you need to transform the labels from {0,1} to {1,−1}.

(a)     Run gradient descent with a fixed step-size. Plot the value of the cost function at each iteration and find a reasonable step-size for fast convergence

(b)    Repeat the previous part using gradient descent with momentum.

(c)     Implement gradient descent with Armijo line search. This procedure is as follows: Assume that we are at the point xk and have a search direction pk (for gradient descent pk = −∇f(xk)). Then, the Armijo line search procedure is:

•    Pick an initial step-size t

•    Initialize the parameters 0 < ρ < 1 and 0 < c < 1 (typical values are c = 10−4 and ρ = 0.9)

•    While f(xk + tpk) f(xk) + ct∇f(xk)T pk, do t ← ρt

•    Terminate the procedure if f(xk + tpk) ≤ f(xk) + ct∇f(xk)T pk

The test statement in the while loop is the Armijo condition. If pk = −∇f(xk), then the test is accepted when . In general, the second term is negative as long as pk is a descent direction. One can prove this linesearch procedure will terminate.

Find a good estimate for the initial step-size by trial and error. A simple idea is to use the final step-size from the previous step, but this can be unnecessarily small. You may want to do this, but increase the step-size by a factor of 2.

3.    Newton’s Method is Affine Invariant (20 pts)
In this question, we will prove the affine invariance of Newton’s method. Let f : Rn → R be a convex function. Consider an affine transform y → Ay + b, where A ∈ Rn×n is invertible and b ∈ Rn. Define the function g : Rn → R by g(y) = f(Ay + b). Denote by x(k) the kth iterate of Newton’s method performed on f. Denote y(k) the kth iterate of Newton’s method performed on g.

(a)     Show that if x(k) = Ay(k) + b, then x(k+1) = Ay(k+1) + b.

(b)    Show that Newton’s decrement does not depend on the coordinates, i.e., show that λ(x(k)) = λ(y(k)), where λ(x) = (∇f(x)T ∇2f(x)−1∇f(x))1/2.

Together, this implies that Newton’s method is affine invariant. As an important consequence, Newton’s method cannot be improved by a change of coordinates, unlike gradient descent.

4.    Newton’s Method for Convex Optimization
(a)     Implement Newton’s method for the logistic regression problem in Problem 1. Plot the value of the cost function at each iteration and find a reasonable step-size for fast convergence.

(b)    Implement randomized Newton’s method with uniform sampling sketch, i.e., sampling rows of H1/2 uniformly at random where H and H1/2 denote Hessian and its square-root respectively. Plot the value of the cost function at each iteration and find a reasonable step-size and sketch-size for fast convergence.

5.    Fast Johnson-Lindenstrauss Transform (FJLT) using Hadamard Matrices
(a)     Construct an 128 × 1024 FJLT matrix as follows

Set m = 128 and n = 1024

Define  

Construct H10 ∈ R1024,1024 recursively via 

Generate D as an n×n diagonal matrix of uniformly random ±1 variables (Rademacher distribution)

Generate an m × n uniform sub-sampling matrix P scaled with   (uniform sampling sketch)

Form the FJLT matrix .

(b)    Verify that ST S is a multiple of identity. Scale S appropriately if needed to obtain ST S = I.

(c)     Generate a data matrix A of size 1024 × 10 using i.i.d. standard Gaussian variables. Plot the singular values of A and singular values of SA.

(d)    In part (c), SA is a Johnson-Lindenstrauss embedding of 10 vectors (1024 dimensional column vectors of A) to dimension 128. Verify that the pairwise distances are approximately preserved, i.e., there exists an 

                                                                           (2)

where Ai is the i-th column of A. Find ∗, the smallest value of  that satisfy (2) for a single realization of the random construction. Note that the matrices D and P are constructed randomly, while H is deterministic. What is the minimum, maximum, and mean value of ∗ in 100 random realizations of the construction?

Hint: The JL embedding property in (2) specifies 2d2 linear inequalities of the form

 , hence the smallest .

(e)     Generate a data matrix A of size 1024 × 10, and a vector b of size 1024 × 1 using i.i.d. standard Gaussian variables. Solve the least squares problem

x∗ = argmin  . x

Apply the FJLT to A and b as SA and Sb. Solve the sketched least squares problem

x˜ := argmin  . x

Find the Euclidean distance between the solutions, i.e., kx˜−x∗k2, between their predictions, i.e., kAx˜ − Ax∗k2 and the approximation ratio ( ) of the objective value.

More products