$25
Exercise 1: Lagrange Multipliers (10+10 P)
Let x1,...,xN ∈ Rd be a dataset of N data points. We consider the objective function
N
J(θ) = X||θ − xk||2
k=1
to be minimized with respect to the parameter θ ∈ Rd. In absence of constraints, the parameter θ that minimizes this objective is given by the empirical mean m . However, this is generally not the case when the parameter θ is constrained.
(a) Using the method of Lagrange multipliers, find the parameter θ that minimizes J(θ) subject to the constraint θ>b = 0, with b some unit vector in Rd. Give a geometrical interpretation to your solution.
(b) Using the same method, find the parameter θ that minimizes J(θ) subject to ||θ − c||2 = 1, where c is a vector in Rd different from m. Give a geometrical interpretation to your solution.
Exercise 2: Principal Component Analysis (10+10 P)
We consider a dataset x1,...,xN ∈ Rd. Principal component analysis searches for a unit vector u ∈ Rd such that projecting the data on that vector produces a distribution with maximum variance. Such vector can be found by solving the optimization problem:
argmaxwith kuk2 = 1 u
(a) Show that the problem above can be rewritten as
argmax u>Su with kuk2 = 1 u
N where S = X(xk − m)(xk − m)> is the scatter matrix, and mis the empirical mean.
k=1
(b) Show using the method of Lagrange multipliers that the problem above can be reformulated as solving the eigenvalue problem
Su = λu
and retaining the eigenvector u associated to the highest eigenvalue λ.
Exercise 3: Bounds on Eigenvalues (5+5+5+5 P)
Let λ1 denote the largest eigenvalue of the matrix S. The eigenvalue λ1 quantifies the variance of the data when projected on the first principal component. Because its computation can be expensive, we study how the latter can be bounded with the diagonal elements of the matrix S.
(a) Show that is an upper bound to the eigenvalue λ1.
(b) State the conditions on the data for which the upper bound is tight.
(c) Show that max is a lower bound to the eigenvalue λ1.
(d) State the conditions on the data for which the lower bound is tight.
Exercise 4: Iterative PCA (10 P)
When performing principal component analysis, computing the full eigendecomposition of the scatter matrix S is typically slow, and we are often only interested in the first principal components. An efficient procedure to find the first principal component is power iteration. It starts with a random unit vector w(0) ∈ Rd, and iteratively applies the parameter update
w(t+1) = Sw
until some convergence criterion is met. Here, we would like to show the exponential convergence of power iteration. For this, we look at the error terms
w uk
with k = 2,...,d,
w u
and observe that they should all converge to zero as w approaches the eigenvector u1 and becomes orthogonal to other eigenvectors.
(a) Show that Ek(w(T)) = |λk/λ1|T · Ek(w(0)), i.e. the convergence of the algorithm is exponential with the number of time steps T. (Hint: to show this, it is useful to rewrite the scatter matrix in terms of eigenvalues and eigenvectors, i.e. S
Exercise 5: Programming (30 P)
Download the programming files on ISIS and follow the instructions.
In [1]:
Principal Component Analysis
In this exercise, we will experiment with two different techniques to compute the PCA components of a dataset:
Singular Value Decomposition (SVD)
Power Iteration: A technique that iteratively optimizes the PCA objective.
We consider a random subset of the Labeled Faces in the Wild (LFW) dataset, readily accessible from sklearn, and we apply some basic preprocessing to discount strong variations of luminosity and contrast.
Two functions are provided for your convenience and are available in utils.py that is included in the zip archive. The functions are the following:
utils.scatterplot produces a scatter plot from a two-dimensional data set. utils.render takes an array of data points or objects of similar shape, and renders them in the IPython notebook.
Some demo code that makes use of these functions is given below.
PCA with Singular Value Decomposition (15 P)
Principal components can be found computing a singular value decomposition. Specifically, we assume a matrix Xˉ whose columns contain the data points represented as vectors, and where the data points have been centered (i.e. we have substracted to each of them the mean of the dataset). The matrix Xˉ is
1 of size d × N where d is the number of input features and N is the number of data points. This matrix, more specifically, the rescaled matrix Z Xˉ is then decomposed using singular value decomposition:
UΛV = Z
The k principal components can then be found in the first k columns of the matrix U.
Tasks:
Compute the principal components of the data using the function numpy.linalg.svd .
Measure the computational time required to find the principal components. Use the function time.time() for that purpose. Do not include in your estimate the computation overhead caused by loading the data, plotting and rendering.
Plot the projection of the dataset on the first two principal components using the function utils.scatterplot .
Visualize the 60 leading principal components using the function utils.render .
Note that if the algorithm runs for more than 3 minutes, there may be some error in your implementation.
When looking at the scatter plot, we observe that much more variance is expressed in the first two principal components than in individual dimensions as it was plotted before. When looking at the principal components themselves which we render as images, we can see that the first principal components correspond to low-frequency filters that select for coarse features, and the following principal components capture progressively higher-frequency information and are also becoming more noisy.
PCA with Power Iteration (15 P)
The first PCA algorithm based on singular value decomposition is quite expensive to compute. Instead, the power iteration algorithm looks only for the first component and finds it using an iterative procedure. It starts with an initial weight vector w∈ Rd, and repeatedly applies the update rule w ← Sw / ‖
where S is the covariance matrix defined as S = \frac1N \bar{X}\bar{X}^\top. Like for standard PCA, the objective that iterative PCA optimizes is J(\boldsymbol{w}) = \boldsymbol{w}^\top S \boldsymbol{w} subject to the unit norm constraint for \boldsymbol{w}. We can therefore keep track of the progress of the algorithm after each iteration.
Tasks:
Implement the power iteration algorithm. Use as a stopping criterion the value of J(\boldsymbol{w}) between two iterations increasing by less than 0.01.
Print the value of the objective function J(\boldsymbol{w}) at each iteration.
Measure the time taken to find the principal component.
Visualize the the eigenvector \boldsymbol{w} obtained after convergence using the function utils.render .
Note that if the algorithm runs for more than 1 minute, there may be some error in your implementation.
In [5]:
iteration 0 J(w) = 0.656 iteration 1 J(w) = 126.401 iteration 2 J(w) = 207.474 iteration 3 J(w) = 221.764 iteration 4 J(w) = 231.388 iteration 5 J(w) = 241.061 iteration 6 J(w) = 250.240 iteration 7 J(w) = 257.989 iteration 8 J(w) = 263.840 iteration 9 J(w) = 267.884 iteration 10 J(w) = 270.508 iteration 11 J(w) = 272.142 iteration 12 J(w) = 273.133 iteration 13 J(w) = 273.725 iteration 14 J(w) = 274.075 iteration 15 J(w) = 274.281 iteration 16 J(w) = 274.402 iteration 17 J(w) = 274.473 iteration 18 J(w) = 274.514 iteration 19 J(w) = 274.538 iteration 20 J(w) = 274.552 iteration 21 J(w) = 274.561 iteration 22 J(w) = 274.565 iteration 23 J(w) = 274.568 iteration 24 J(w) = 274.570 Time: 5.962 seconds
We observe that the computation time has decreased significantly. The difference of performance becomes larger as the number of dimensions and data points increases. We can observe that the principal component is the same (sometimes up to a sign flip) as the one obtained by the SVD algorithm.