$30
Exercise 1: Symmetries in LLE
The Locally Linear Embedding (LLE) method takes as input a collection of data points ~x1,...,~xN ∈ Rd and embeds them in some low-dimensional space. LLE operates in two steps, with the first step consisting of minimizing the objective
where w is a collection of reconstruction weights subject to the constraint ∀i : Pj wij = 1, and where Pj sums over the K nearest neighbors of the data point ~xi. The solution that minimizes the LLE objective can be shown to be invariant to various transformations of the data.
Show that invariance holds in particular for the following transformations:
(a) Replacement of all ~xi with α~xi, for an α ∈ R+ \ {0},
(b) Replacement of all ~xi with ~xi +~v, for a vector ~v ∈ Rd,
(c) Replacement of all ~xi with U~xi, where U is an orthogonal d × d matrix.
Exercise 2: Closed form for LLE
In the following, we would like to show that the optimal weights w have an explicit analytic solution. For this, we first observe that the objective function can be decomposed as a sum of as many subobjectives as there are data points:
N
E(w) = XEi(w) with
i=1
Furthermore, because each subobjective depends on different parameters, they can be optimized independently. We consider one such subobjective and for simplicity of notation, we rewrite it as:
where ~x is the current data point (we have dropped the index i), where η = (~η1,...,~ηK) is a matrix of size K × d containing the K nearest neighbors of ~x, and w is the vector of size K containing the weights to optimize and subject to the constraint
(a) Prove that the optimal weights for ~x are found by solving the following optimization problem:
min w>Cw subject to w>1 = 1. w
where C = (1~x> − η)(1~x> − η)> is the covariance matrix associated to the data point ~x and 1 is a vector of ones of size K.
(b) Show using the method of Lagrange multipliers that the minimum of the optimization problem found in (a) is given analytically as:
.
(c) Show that the optimal w can be equivalently found by solving the equation Cw = 1 and then rescaling w such that w>1 = 1.
Exercise 3: SNE and Kullback-Leibler Divergence
SNE is an embedding algorithm that operates by minimizing the Kullback-Leibler divergence between two discrete probability distributions p and q representing the input space and the embedding space respectively. In ‘symmetric SNE’, these discrete distributions assign to each pair of data points (i,j) in the dataset the probability scores pij and qij respectively, corresponding to how close the two data points are in the input and embedding spaces. Once the exact probability functions are defined, the embedding algorithm proceeds by optimizing the function:
where p and q are subject to the constraints = 1 and = 1. Specifically, the algorithm minimizes q which itself is a function of the coordinates in the embedded space. Optimization is typically performed using gradient descent.
In this exercise, we derive the gradient of the Kullback-Leibler divergence, first with respect to the probability scores qij, and then with respect to the embedding coordinates of which qij is a function. (a) Show that
. (1)
(b) The probability matrix q is now reparameterized using a ‘softargmax’ function:
The new variables zij can be interpreted as unnormalized log-probabilities. Show that
. (2)
(c) Explain which of the two gradients, (1) or (2), is the most appropriate for practical use in a gradient descent algorithm. Motivate your choice, first in terms of the stability or boundedness of the gradient, and second in terms of the ability to maintain a valid probability distribution during training. (d) The scores zij are now reparameterized as
zij = −k~yi − ~yjk2
where the coordinates ~yi,~yj ∈ Rh of data points in embedded space now appear explicitly. Show using the chain rule for derivatives that
.
Exercise 4: Programming
Download the programming files on ISIS and follow the instructions.
Exercise sheet 1 (programming) Machine Learning 2
Implementing Locally Linear Embedding
In this programming homework we will implement locally linear embedding (LLE) and experiment with it on the swiss roll dataset. In particular, the effects of neighbourhood size and noise on the quality of the embedding will be analyzed.
Although the dataset is in three dimensions, the points follow a two-dimensional low-dimensional structure. The goal of embedding algorithms is to extract this underlying structure, in this case, unrolling the swiss roll into a two-dimensional Euclidean space.
In the following, we consider a simple implementation of LLE. You are required to complete the code by writing the portion where the optimal reconstruction weights are extracted. (Hint: During computation, you need to solve an equation of the type Cw=1, where 1 is a column vector (1,1,...,1). In case k>d i.e. the size of the neighbourhood is larger than the number of dimensions of the input space, it is necessary to regularize the matrix C. You can do this by adding positive terms on the diagonal. A good starting point is 0.05.)
You can now test your implementation on the swiss roll dataset and vary the noise in the data and the parameter k of the LLE algorithm. Results are shown below:
It can be observed that the parameter k must be carefully tuned to have sufficiently many neighbors for stability but also not too many. We can further observe that LLE works well as long as the noise in the data remains low enough.