Starting from:

$25

machine_learning1 - Exercise Sheet 5 - Solved

Exercise 1: Bias and Variance of Mean Estimators (20 P)
Assume we have an estimator θˆ for a parameter θ. The bias of the estimator θˆ is the difference between the true value for the estimator, and its expected value

Bias( .

If Bias(θˆ) = 0, then θˆ is called unbiased. The variance of the estimator θˆ is the expected square deviation from its expected value

 .

The mean squared error of the estimator θˆ is

                                                                               Error(  = Bias(θˆ)2 + Var(θˆ).

Let X1,...,XN be a sample of i.i.d random variables. Assume that Xi has mean µ and variance σ2. Calculate the bias, variance and mean squared error of the mean estimator:

 

where α is a parameter between 0 and 1.

Exercise 2: Bias-Variance Decomposition for Classification (30 P)
The bias-variance decomposition usually applies to regression data. In this exercise, we would like to obtain similar decomposition for classification, in particular, when the prediction is given as a probability distribution over C classes. Let P = [P1,...,PC] be the ground truth class distribution associated to a particular input pattern. Assume a random estimator of class probabilities Pˆ = [Pˆ1,...,PˆC] for the same input pattern. The error function is given by the expected KL-divergence between the ground truth and the estimated probability distribution:

Error = E .

First, we would like to determine the mean of of the class distribution estimator Pˆ. We define the mean as the distribution that minimizes its expected KL divergence from the the class distribution estimator, that is, the distribution R that optimizes

 .

(a)     Show that the solution to the optimization problem above is given by

                                                        R = [R1,...,RC]          where                  ∀ 1 ≤ i ≤ C.

(Hint: To implement the positivity constraint on R, you can reparameterize its components as Ri = exp(Zi), and minimize the objective w.r.t. Z.)

(b)    Prove the bias-variance decomposition

Error(Pˆ) = Bias(Pˆ) + Var(Pˆ)

where the error, bias and variance are given by

                                 Error( ,              Bias(Pˆ) = DKL(P||R),               .

(Hint: as a first step, it can be useful to show that E[logRi − logPˆi] does not depend on the index i.)

Exercise 3: Programming (50 P)
Download the programming files on ISIS and follow the instructions.



Part   1:       The    James-Stein         Estimator    (20     P)
Let           x1, …, xN ∈ Rd           be            independent           draws      from        a              multivariate             Gaussian distribution              with mean       vector     μ              and         covariance              matrix      Σ = σ2I.     It              can          be            shown     that the    maximum-likelihood estimator of             the           mean       parameter               μ              is             the           empirical mean       given      by:

N

 xi

N

i=1

Maximum-likelihood appears  to             be            a              strong     estimator. However, it              was         demonstrated         that          the following estimator

μˆJS =  ML

‖μML‖

(a             shrinked  version    of             the           maximum-likelihood estimator towards   the           origin)     has          actually   a              smaller distance  from        the           true          mean       when       d ≥ 3.       This        however assumes  knowledge              of             the variance  of             the           distribution              for           which      the           mean       is             estimated.               This        estimator is called      the           James-Stein           estimator. While      the           proof is   a              bit            involved, this          fact          can          be easily      demonstrated         empirically              through   simulation.              This        is             the           object      of             this exercise.

The          code       below      draws      ten           50-dimensional       points      from        a              normal    distribution              with         mean vector     μ = (1, …, 1)             and         covariance              Σ = I.

 

In             [2]:

 

Implementing the       James-Stein   Estimator        (10       P)
  Based               on           the          ML           estimator function, write       a              function that         receives  as            input       the         data        (Xi)ni=1      and         the          (known) variance σ2             of            the          generating distribution,       and         computes              the          James-Stein          estimator

 

Comparing      the       ML       and      James-Stein   Estimators      (10       P)
We           would      like          to             compute the           error        of             the           maximum likelihood estimator and         the           James-Stein       estimator for           100          different  samples  (where    each        sample consists      of             10            draws      generated by            the           function   getdata    with         a              different  random   seed).     Here,       for           reproducibility,        we           use seeds      from        0              to             99.           The         error should            be            measured as            the           Euclidean distance between  the           true          mean       vector     and         the           estimated mean       vector.

 Compute            the          maximum-likelihood             and         James-Stein          estimations.

Measure              the          error        of            these      estimations.

Build   a              scatter    plot         comparing             these      errors      for           different samples.

 

Part   2:       Bias/Variance       Decomposition     (30     P)
In             this          part,        we           would      like          to             implement               a              procedure               to             find         the bias         and         variance  of             different  predictors.              We          consider  one         for           regression              and         one for classification.     These     predictors are           available in             the           module   utils.

 utils.ParzenRegressor :          A              regression              method   based      on           Parzen    window.  The         hyperparameter             corresponds           to             the           scale       of             the           Parzen    window.

A          large       scale       creates    a              more       rigid        model.    A              small       scale       creates    a              more       flexible            one.

utils.ParzenClassifier :           A              classification           method   based      on           Parzen    window.  The         hyperparameter             corresponds           to             the           scale       of             the           Parzen

window.           A              large       scale       creates    a              more       rigid        model.    A              small       scale       creates    a         more       flexible    one.        Note        that          instead    of             returning a              single      class       for           a              given data point,      it              outputs   a              probability               distribution              over        the           set           of             possible         classes.

Each        class       of             predictor implements             the           following three       methods:

 __init__(self,parameter):       Create     an            instance  of             the           predictor with         a            certain    scale       parameter. fit(self,X,T):            Fit            the           predictor to             the            data         (a             set           of             data         points      X              and         targets    T ). predict(self,X):      Compute the           output     values     arbitrary  inputs      X .

To            compute the           bias         and         variance  estimates,               we           require    multiple    samples  from        the           training set           for           a              single      set           of             observation             data.        To           acomplish               this,         we utilize the Sampler   class       provided. The         sampler  is             initialized with         the           training   data         and         passed    to             the method   for           estimating               bias         and         variance, where      its function              sampler.sample()      is             called repeatedly              in             order       to             fit             multiple   models    and         create      an            ensemble of             prediction for each        test          data         point.

Regression     Case    (15       P)
For          the           regression              case,       Bias,        Variance and         Error       are           given      by:

 Bias(Y)2 = (EY[Y − T])2

       Var(Y)   = EY[(Y − EY[Y])2]

Error(Y) =              EY[(Y − T)2]

Task:       Implement              the           KL-based Bias-Variance         Decomposition       defined   above.     The         function   should repeatedly              sample    training   sets         from        the           sampler  (as many times       as            specified by            the argument nbsamples),            learn       the           predictor on           them,       and         evaluate  the           variance  on           the           out-of-sample    distribution              given      by            X and      T.

 

Your        implementation       can          be            tested      with         the           following code:

 

Classification Case    (15       P)
We           consider  here        the           Kullback-Leibler      divergence              as            a              measure  of             classification           error, as            derived   in             the           exercise, the           Bias,        Variance decomposition        for such  error        is:

 Bias(Y) = DKL(T | | R)

Var(Y) = EY[DKL(R | | Y)]

Error(Y) = EY[DKL(T | | Y)]

where      R              is             the           distribution              that          minimizes its            expected KL            divergence              from        the estimator of             probability               distribution              Y              (see         the           theoretical               exercise  for           how         it is computed exactly),  and         where      T              is             the           target      class       distribution.

Task:       Implement              the           KL-based Bias-Variance         Decomposition       defined   above.     The         function   should repeatedly              sample    training   sets         from        the           sampler  (as many times       as            specified by            the argument nbsamples),            learn       the           predictor on           them,       and         evaluate  the           variance  on           the           out-of-sample    distribution              given      by            X and      T.

 

Your        implementation       can          be            tested      with         the           following code:

More products