Starting from:

$30

Machine Learning 2-Exercise Sheet 3 Solved

Exercise 1: Maximum Entropy Distributions (5+10+5 P)
Consider a discrete random variable X defined over positive numbers N, each of them occuring with respective probability Pr(X = k). The entropy H(X) is given by

 ) logPr(X = k)

We would like to find the probability distribution that maximizes the entropy H(X) under the expectation constraint

 .

and forcing probabilities to be strictly positive, i.e. Pr(X = k) > 0 for all k 2N. The latter can be enforced using the reparameterization Pr(X = k) = exp(sk). To enforce that probability values sum to one, we can impose the additional constraint

 .

The problem of finding the maximum-entropy distribution can therefore be modeled as a constrained optimization problem, and such problems can be solved using the method of Lagrange multipliers.

(a)    Write the Lagrange function L((sk)k2N, 1, 2) corresponding to the constrained optimization problem above, where 1, 2 2R are used to incorporate the sum-to-one and the expectation constraints respectively.

(b)    Show that the probability distribution that maximizes the objective H(X) has the form:

Pr(X = k) = ↵ · k

(c)    Show that ↵ = 0.5 and = 0.5, i.e. the probability distribution is geometric with failure and success probabilities 0.5, and k denoting the number of failures before success. (Hint: you can make use of the geometric series and Gabriel’s staircase series).

Exercise 2: Independent Components in Two Dimensions (5+10+10 P)
High entropy can be seen as the result of superposing many independent low-entropy sources. Independent component analysis (ICA) aims to recover the independent sources from the data by finding projections of the data that have low entropy, i.e. that diverge the most from a Gaussian probability distribution.

In the following, we consider a simple two-dimensional problem where we are given a joint probability distribution p(x,y) = p(x)p(y|x) with

 ,

where (·) denotes the Dirac delta function. A useful property of linear component analysis for twodimensional probability distributions is that the set of all possible directions to look for in R2 is directly given by

                                                                                                    .

The projection of the random vector (x,y) on a particular component can therefore be expressed as a function of ✓: z(✓) = x cos(✓) + y sin(✓).

As a result, ICA in the two-dimensional space is reduced to finding the values of the parameter ✓ 2 [0,2⇡[ that maximize a certain objective J(z(✓)).

(a)    Sketch the joint probability distribution p(x,y), along with the projections z(✓) of this distribution for angles

✓ = 0, ✓ = ⇡/8 and ✓ = ⇡/4.

(b)    Find the principal components of p(x,y). That is, find the values of the parameter ✓2 [0,2⇡[ that maximize the variance of the projected data z(✓).

(c)    Find the independent components of p(x,y), more specifically, find the values of the parameter ✓ 2 [0,2⇡[ that maximize the non-Gaussianity of z(✓). We use as a measure of non-Gaussianity the excess kurtosis defined as kurt[               .

Exercise 3: Deriving a Special Case of FastICA (5+5+5+10 P)
We consider our input data x to be in Rd and coming from some distribution p(x). We assume we have applied as an initial step the centering and whitening procedures so that:

                                                                                        E[x] = 0           and            E[xx>] = I.

To extract an independent component, we would like to find a unit vector w (i.e. kwk2 = 1) such that the excess kurtosis of the projected data:

 kurt[w>x] = ⇥ (Var[w>x])2 ⇤ 3. E (w>x E[w>x])4

is maximized.

(a)     Show that for any w subject to kwk2 = 1, the projection z = w>x has mean 0 and variance 1.

(b)    Show using the method of Lagrange multipliers that the projection w that maximizes kurt[w>x] is a solution of the equation w = E[x · (w>x)3]

        where        can be resolved by enforcing the constraint kwk2 = 1.

(c)     The solution of the previous equation cannot be found analytically, and must instead be solved iterativelyusing e.g. the Newton’s method. The Newton’s method assumes that the equation is given in the form F(w) = 0 and then defines the iteration as w+ = w J(w) 1F(w) where w+ denotes the next iterate and where J is the Jacobian associated to the function F. Show that the Newton iteration in our case takes the form:

                                                               w+ = w          (E[3xx>(w>x)2                     I]) 1 · (E[x · (w>x)3]            w)

(d)    Show that when making the decorrelation approximation E[xx>(w>x)2] = E[xx>] · E[(w>x)2], the Newton iteration further reduces to

                                                                                            w+ =        · (E[x · (w>x)3]       3w)

where is some constant factor. (The constant factor does not need to be resolved since we subsequently apply the normalization step w+ w+/kw+k.)

Exercise 4: Programming (30 P)
Download the programming files on ISIS and follow the instructions.

Exercise                                                                                                                                                                                                                                   sheet                                                                                                                                                                                                                                          3                                                                                                                                                                                                                  (programming)                                                                                                                                                                                                                                                                                                                                                                                                                                                             Machine                                                                                                                                                                                                                             Learning                                                                                                                                                                                                                                          2

 

Independent    Component      Analysis
In             this          exercise, you          will          implement               an            ICA          algorithm similar     to             the           FastICA   method described in             the           paper      "A.           Hyvärinen               and          E.             Oja.         2000.       Independent component analysis:  algorithms               and          applications"            linked      from        ISIS,        and         apply      it              to             model     the independent           components            of             a              distribution              of             image      patches.

 

 

data         is             then        centered and         standardized.

In     [3]:

 

Whitening   (10     P)
A              precondition            for           applying  the           ICA          procedure               is             that          the           input       data         has variance  $1$          under      any          projection.               This        can          be            achieved by            whitening,               which      is a transformation     $\mathcal{W}:\mathbb{R}^d    \to            \mathbb{R}^d$        with         $z            =              \mathcal{W}(x)$      such        that $\mathrm{E}[zz^\top]              =              I$.

A              simple     procedure               for           whitening a              collection of             data         points      $x_1,\dots,x_N$     (assumed to be            centered) first          computes the           PCA         components $u_1,\dots,u_d$ of             the           data         and         then        applies the           following three       consecutive            steps:

1.   project             the           data         on           the           PCA         components            i.e.           $p_{n,i}   =              x_n^\top  u_i$.

2.   divide              the           projected data         by            the           standard deviation in             PCA         space,     i.e.           $\tilde{p}_{n,i}         =              p_{n,i}     /               \text{std}(p_{:,i})$

3.   backproject     to             the           input       space      $z_n        =              \sum_i     \tilde{p}_{n,i}           u_i$.

Task:

  Implement       this         whitening              procedure,             in             particular,              write       a              function that         receives  the          input       data        matrix     and         returns   the          matrix     containing             all whitened           data         points.

For          efficiency, your        whitening procedure               should     be            implemented           in             matrix      form.

 

 

Finally,    to             get           visual      picture    of             what        will          enter       into         our          ICA          algorithm, the whitened data         can          be            visualized in             the           same       way         as            the           original    input       data.

 

We           observe  that          all            high        constrasts               and         spatial     correlations             have        been        removed after whitening.               Remaining              patterns  include    high-frequency       textures   and oriented           edges     of             different  colors.

Implementing       ICA    (20     P)
We           now         would      like          to             learn       $h=64$    independent           components            of             the           distribution of             whitened image      patches.  For          this,         we           adopt      a              procedure               similar     to             the FastICA procedure               described in             the           paper      above.     In             particular, we           start         with         random   weights $w_1,\dots,w_h      \in            \mathbb{R}^d$        and         iterate      multiple   times the sequence of             operations:

1.          $\forall_{i=1}^d~w_i =              \mathbb{E}[x           \cdot       g(w_i^\top               x)]            -               w_i          \cdot       \mathbb{E}[             g'(w_i^\top              x)]$

2.          $w_1,\dots,w_h=\text{decorrelate}\{w_1,\dots,w_h\}$ where  $\mathbb{E}[\cdot]$ denotes  the               expectation             with         the           data         distribution.

The          first          step         increases non-Gaussianity     of             the           projected data.        Here,       we           will          make       use of             the           nonquadratic          function   $G(x)      =              \frac1a     \log         \cosh       (a             x)$           with $a=1.5$.          This function   admits     as            a              derivative the           function   $g(x)       =              \tanh(a    x)$,          and         as            a              double derivative the           function   $g'(x)      =              a              \cdot       (1-\tanh^2(a            x))$.

The          second    step         enforces that          the           learned   projections              are           decorrelated,          i.e.\          $w_i^\top w_j =              1_{i=j}$.  It              will          be            implemented           by            calling     in             an            appropriate manner the whitening procedure               which      we           have        already   implemented           to             decorrelate             the           different  input dimensions.

This         procedure               minimizes the           non-Gaussianity     of             the           projected data         as            measured by            the objective:

$$            J(w)         =              \sum_{i=1}^h          (\mathbb{E}[G(w_i^\top          x)]            -               \mathbb{E}[G(\varepsilon)])^2 \qquad \text{where}            \quad      \varepsilon              \sim         \mathcal{N}(0,1).    $$ Task:

  Implement           the          ICA          procedure              described               above,    run          it              for           200          iterations,                and         print        the          value       of            the          objective function every       25            iterations.

In             order       to             keep        the           learning  procedure               computationally      affordable,              the           code       must be            parallelized,            in             particular, make       use          of             numpy    matrix      multiplications instead            of             loops whenever it              is             possible.

 

it     =       75                   J(W)   = 1.96 it =      100               J(W)   =      2.03 it     =      125               J(W)   =      2.07 it     =      150               J(W)   =      2.09 it     =      175               J(W)   =      2.10 it     =      200               J(W)   =      2.12

Because  the           learned   ICA          components            are           in             a              space      of             same       dimensions             as the           input       data,        they         can          also         be            visualized as            image      patches.

 

We           observe  that          an            interesting               decomposition        appears, composed               of             frequency filters,      edges filters       and         localized texture     filters.      The         decomposition        further aligns          on           specific   directions of             the RGB        space,     specifically              yellow/blue             and         red/cyan.

To            verify       that          strongly   non-Gaussian         components            have        been        learned,  we           build        a histogram of             projections              on           the           various    ICA          components            and         compare it to          histograms for           random   projections.

 

We           observe  that          the           ICA          projections              have        much       heavier   tails.        This        is             a              typical characteristic          of             independent           components            of             a              data         distribution.

More products