Starting from:

$30

Machine Learning 2-Exercise Sheet 1 Solved

Exercise 1: Symmetries in LLE 
The Locally Linear Embedding (LLE) method takes as input a collection of data points ~x1,...,~xN ∈ Rd and embeds them in some low-dimensional space. LLE operates in two steps, with the first step consisting of minimizing the objective

 

where w is a collection of reconstruction weights subject to the constraint ∀i : Pj wij = 1, and where Pj sums over the K nearest neighbors of the data point ~xi. The solution that minimizes the LLE objective can be shown to be invariant to various transformations of the data.

Show that invariance holds in particular for the following transformations:

(a)    Replacement of all ~xi with α~xi, for an α ∈ R+ \ {0},

(b)    Replacement of all ~xi with ~xi +~v, for a vector ~v ∈ Rd,

(c)    Replacement of all ~xi with U~xi, where U is an orthogonal d × d matrix.

Exercise 2: Closed form for LLE 
In the following, we would like to show that the optimal weights w have an explicit analytic solution. For this, we first observe that the objective function can be decomposed as a sum of as many subobjectives as there are data points:

N

                                                                      E(w) = XEi(w)             with

i=1

Furthermore, because each subobjective depends on different parameters, they can be optimized independently. We consider one such subobjective and for simplicity of notation, we rewrite it as:

 

where ~x is the current data point (we have dropped the index i), where η = (~η1,...,~ηK) is a matrix of size K × d containing the K nearest neighbors of ~x, and w is the vector of size K containing the weights to optimize and subject to the constraint 

(a)    Prove that the optimal weights for ~x are found by solving the following optimization problem:

min       w>Cw    subject to            w>1 = 1. w

where C = (1~x> − η)(1~x> − η)> is the covariance matrix associated to the data point ~x and 1 is a vector of ones of size K.

(b)    Show using the method of Lagrange multipliers that the minimum of the optimization problem found in (a) is given analytically as:

 .

(c)    Show that the optimal w can be equivalently found by solving the equation Cw = 1 and then rescaling w such that w>1 = 1.

Exercise 3: SNE and Kullback-Leibler Divergence 
SNE is an embedding algorithm that operates by minimizing the Kullback-Leibler divergence between two discrete probability distributions p and q representing the input space and the embedding space respectively. In ‘symmetric SNE’, these discrete distributions assign to each pair of data points (i,j) in the dataset the probability scores pij and qij respectively, corresponding to how close the two data points are in the input and embedding spaces. Once the exact probability functions are defined, the embedding algorithm proceeds by optimizing the function:

 

where p and q are subject to the constraints  = 1 and  = 1. Specifically, the algorithm minimizes q which itself is a function of the coordinates in the embedded space. Optimization is typically performed using gradient descent.

In this exercise, we derive the gradient of the Kullback-Leibler divergence, first with respect to the probability scores qij, and then with respect to the embedding coordinates of which qij is a function. (a) Show that

                                                                                                              .                                                                                             (1)

(b)    The probability matrix q is now reparameterized using a ‘softargmax’ function:

 

The new variables zij can be interpreted as unnormalized log-probabilities. Show that

                                                                                                         .                                                                                        (2)

(c)     Explain which of the two gradients, (1) or (2), is the most appropriate for practical use in a gradient descent algorithm. Motivate your choice, first in terms of the stability or boundedness of the gradient, and second in terms of the ability to maintain a valid probability distribution during training. (d) The scores zij are now reparameterized as

zij = −k~yi − ~yjk2

where the coordinates ~yi,~yj ∈ Rh of data points in embedded space now appear explicitly. Show using the chain rule for derivatives that

 .

Exercise 4: Programming 
Download the programming files on ISIS and follow the instructions.

Exercise                                                                                                                                                                                                                                  sheet                                                                                                                                                                                                                                          1                                                                                                                                                                                                                  (programming)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Machine                                                                                                                                                                                                                             Learning                                                                                                                                                                                                                                          2

 

Implementing       Locally        Linear         Embedding 


 
 
 

In             this          programming          homework               we           will          implement               locally     linear      embedding             (LLE) and         experiment             with         it              on           the           swiss      roll          dataset.   In             particular, the           effects of neighbourhood       size         and         noise      on           the           quality     of             the           embedding             will          be analyzed.

 

 

Although the           dataset    is             in             three       dimensions,            the           points      follow      a              two-dimensional     low-dimensional            structure. The         goal        of             embedding             algorithms               is             to             extract this underlying               structure, in             this          case,       unrolling  the           swiss      roll          into         a              two-dimensional Euclidean space.

In             the           following, we           consider  a              simple     implementation       of             LLE.         You         are           required  to                complete the           code       by            writing     the           portion    where      the           optimal reconstruction            weights   are                extracted. (Hint:       During     computation,           you          need        to             solve       an            equation  of             the           type                Cw=1,      where      1              is             a              column    vector      (1,1,...,1). In case    k>d          i.e.           the           size          of                the           neighbourhood        is             larger       than         the           number    of             dimensions              of             the           input                space,     it              is             necessary               to             regularize the           matrix      C.            You          can do     this          by                adding     positive    terms       on            the           diagonal.  A              good        starting    point        is             0.05.)

You          can          now         test          your        implementation       on           the           swiss      roll          dataset    and         vary         the noise      in             the           data         and         the           parameter               k              of             the           LLE          algorithm. Results   are shown     below:

 

It              can          be            observed that          the           parameter               k              must        be            carefully  tuned      to             have sufficiently              many       neighbors for           stability   but           also         not          too          many.      We          can          further observe that          LLE          works      well         as            long        as            the           noise      in             the           data         remains  low enough.

More products