$30
Exercise 1: Maximum Entropy Distributions (5+10+5 P)
Consider a discrete random variable X defined over positive numbers N, each of them occuring with respective probability Pr(X = k). The entropy H(X) is given by
) logPr(X = k)
We would like to find the probability distribution that maximizes the entropy H(X) under the expectation constraint
.
and forcing probabilities to be strictly positive, i.e. Pr(X = k) > 0 for all k 2N. The latter can be enforced using the reparameterization Pr(X = k) = exp(sk). To enforce that probability values sum to one, we can impose the additional constraint
.
The problem of finding the maximum-entropy distribution can therefore be modeled as a constrained optimization problem, and such problems can be solved using the method of Lagrange multipliers.
(a) Write the Lagrange function L((sk)k2N, 1, 2) corresponding to the constrained optimization problem above, where 1, 2 2R are used to incorporate the sum-to-one and the expectation constraints respectively.
(b) Show that the probability distribution that maximizes the objective H(X) has the form:
Pr(X = k) = ↵ · k
(c) Show that ↵ = 0.5 and = 0.5, i.e. the probability distribution is geometric with failure and success probabilities 0.5, and k denoting the number of failures before success. (Hint: you can make use of the geometric series and Gabriel’s staircase series).
Exercise 2: Independent Components in Two Dimensions (5+10+10 P)
High entropy can be seen as the result of superposing many independent low-entropy sources. Independent component analysis (ICA) aims to recover the independent sources from the data by finding projections of the data that have low entropy, i.e. that diverge the most from a Gaussian probability distribution.
In the following, we consider a simple two-dimensional problem where we are given a joint probability distribution p(x,y) = p(x)p(y|x) with
,
where (·) denotes the Dirac delta function. A useful property of linear component analysis for twodimensional probability distributions is that the set of all possible directions to look for in R2 is directly given by
.
The projection of the random vector (x,y) on a particular component can therefore be expressed as a function of ✓: z(✓) = x cos(✓) + y sin(✓).
As a result, ICA in the two-dimensional space is reduced to finding the values of the parameter ✓ 2 [0,2⇡[ that maximize a certain objective J(z(✓)).
(a) Sketch the joint probability distribution p(x,y), along with the projections z(✓) of this distribution for angles
✓ = 0, ✓ = ⇡/8 and ✓ = ⇡/4.
(b) Find the principal components of p(x,y). That is, find the values of the parameter ✓2 [0,2⇡[ that maximize the variance of the projected data z(✓).
(c) Find the independent components of p(x,y), more specifically, find the values of the parameter ✓ 2 [0,2⇡[ that maximize the non-Gaussianity of z(✓). We use as a measure of non-Gaussianity the excess kurtosis defined as kurt[ .
Exercise 3: Deriving a Special Case of FastICA (5+5+5+10 P)
We consider our input data x to be in Rd and coming from some distribution p(x). We assume we have applied as an initial step the centering and whitening procedures so that:
E[x] = 0 and E[xx>] = I.
To extract an independent component, we would like to find a unit vector w (i.e. kwk2 = 1) such that the excess kurtosis of the projected data:
kurt[w>x] = ⇥ (Var[w>x])2 ⇤ 3. E (w>x E[w>x])4
is maximized.
(a) Show that for any w subject to kwk2 = 1, the projection z = w>x has mean 0 and variance 1.
(b) Show using the method of Lagrange multipliers that the projection w that maximizes kurt[w>x] is a solution of the equation w = E[x · (w>x)3]
where can be resolved by enforcing the constraint kwk2 = 1.
(c) The solution of the previous equation cannot be found analytically, and must instead be solved iterativelyusing e.g. the Newton’s method. The Newton’s method assumes that the equation is given in the form F(w) = 0 and then defines the iteration as w+ = w J(w) 1F(w) where w+ denotes the next iterate and where J is the Jacobian associated to the function F. Show that the Newton iteration in our case takes the form:
w+ = w (E[3xx>(w>x)2 I]) 1 · (E[x · (w>x)3] w)
(d) Show that when making the decorrelation approximation E[xx>(w>x)2] = E[xx>] · E[(w>x)2], the Newton iteration further reduces to
w+ = · (E[x · (w>x)3] 3w)
where is some constant factor. (The constant factor does not need to be resolved since we subsequently apply the normalization step w+ w+/kw+k.)
Exercise 4: Programming (30 P)
Download the programming files on ISIS and follow the instructions.
Exercise sheet 3 (programming) Machine Learning 2
Independent Component Analysis
In this exercise, you will implement an ICA algorithm similar to the FastICA method described in the paper "A. Hyvärinen and E. Oja. 2000. Independent component analysis: algorithms and applications" linked from ISIS, and apply it to model the independent components of a distribution of image patches.
data is then centered and standardized.
In [3]:
Whitening (10 P)
A precondition for applying the ICA procedure is that the input data has variance $1$ under any projection. This can be achieved by whitening, which is a transformation $\mathcal{W}:\mathbb{R}^d \to \mathbb{R}^d$ with $z = \mathcal{W}(x)$ such that $\mathrm{E}[zz^\top] = I$.
A simple procedure for whitening a collection of data points $x_1,\dots,x_N$ (assumed to be centered) first computes the PCA components $u_1,\dots,u_d$ of the data and then applies the following three consecutive steps:
1. project the data on the PCA components i.e. $p_{n,i} = x_n^\top u_i$.
2. divide the projected data by the standard deviation in PCA space, i.e. $\tilde{p}_{n,i} = p_{n,i} / \text{std}(p_{:,i})$
3. backproject to the input space $z_n = \sum_i \tilde{p}_{n,i} u_i$.
Task:
Implement this whitening procedure, in particular, write a function that receives the input data matrix and returns the matrix containing all whitened data points.
For efficiency, your whitening procedure should be implemented in matrix form.
Finally, to get visual picture of what will enter into our ICA algorithm, the whitened data can be visualized in the same way as the original input data.
We observe that all high constrasts and spatial correlations have been removed after whitening. Remaining patterns include high-frequency textures and oriented edges of different colors.
Implementing ICA (20 P)
We now would like to learn $h=64$ independent components of the distribution of whitened image patches. For this, we adopt a procedure similar to the FastICA procedure described in the paper above. In particular, we start with random weights $w_1,\dots,w_h \in \mathbb{R}^d$ and iterate multiple times the sequence of operations:
1. $\forall_{i=1}^d~w_i = \mathbb{E}[x \cdot g(w_i^\top x)] - w_i \cdot \mathbb{E}[ g'(w_i^\top x)]$
2. $w_1,\dots,w_h=\text{decorrelate}\{w_1,\dots,w_h\}$ where $\mathbb{E}[\cdot]$ denotes the expectation with the data distribution.
The first step increases non-Gaussianity of the projected data. Here, we will make use of the nonquadratic function $G(x) = \frac1a \log \cosh (a x)$ with $a=1.5$. This function admits as a derivative the function $g(x) = \tanh(a x)$, and as a double derivative the function $g'(x) = a \cdot (1-\tanh^2(a x))$.
The second step enforces that the learned projections are decorrelated, i.e.\ $w_i^\top w_j = 1_{i=j}$. It will be implemented by calling in an appropriate manner the whitening procedure which we have already implemented to decorrelate the different input dimensions.
This procedure minimizes the non-Gaussianity of the projected data as measured by the objective:
$$ J(w) = \sum_{i=1}^h (\mathbb{E}[G(w_i^\top x)] - \mathbb{E}[G(\varepsilon)])^2 \qquad \text{where} \quad \varepsilon \sim \mathcal{N}(0,1). $$ Task:
Implement the ICA procedure described above, run it for 200 iterations, and print the value of the objective function every 25 iterations.
In order to keep the learning procedure computationally affordable, the code must be parallelized, in particular, make use of numpy matrix multiplications instead of loops whenever it is possible.
it = 75 J(W) = 1.96 it = 100 J(W) = 2.03 it = 125 J(W) = 2.07 it = 150 J(W) = 2.09 it = 175 J(W) = 2.10 it = 200 J(W) = 2.12
Because the learned ICA components are in a space of same dimensions as the input data, they can also be visualized as image patches.
We observe that an interesting decomposition appears, composed of frequency filters, edges filters and localized texture filters. The decomposition further aligns on specific directions of the RGB space, specifically yellow/blue and red/cyan.
To verify that strongly non-Gaussian components have been learned, we build a histogram of projections on the various ICA components and compare it to histograms for random projections.
We observe that the ICA projections have much heavier tails. This is a typical characteristic of independent components of a data distribution.