$25
Exercise 1: Bias and Variance of Mean Estimators (20 P)
Assume we have an estimator θˆ for a parameter θ. The bias of the estimator θˆ is the difference between the true value for the estimator, and its expected value
Bias( .
If Bias(θˆ) = 0, then θˆ is called unbiased. The variance of the estimator θˆ is the expected square deviation from its expected value
.
The mean squared error of the estimator θˆ is
Error( = Bias(θˆ)2 + Var(θˆ).
Let X1,...,XN be a sample of i.i.d random variables. Assume that Xi has mean µ and variance σ2. Calculate the bias, variance and mean squared error of the mean estimator:
where α is a parameter between 0 and 1.
Exercise 2: Bias-Variance Decomposition for Classification (30 P)
The bias-variance decomposition usually applies to regression data. In this exercise, we would like to obtain similar decomposition for classification, in particular, when the prediction is given as a probability distribution over C classes. Let P = [P1,...,PC] be the ground truth class distribution associated to a particular input pattern. Assume a random estimator of class probabilities Pˆ = [Pˆ1,...,PˆC] for the same input pattern. The error function is given by the expected KL-divergence between the ground truth and the estimated probability distribution:
Error = E .
First, we would like to determine the mean of of the class distribution estimator Pˆ. We define the mean as the distribution that minimizes its expected KL divergence from the the class distribution estimator, that is, the distribution R that optimizes
.
(a) Show that the solution to the optimization problem above is given by
R = [R1,...,RC] where ∀ 1 ≤ i ≤ C.
(Hint: To implement the positivity constraint on R, you can reparameterize its components as Ri = exp(Zi), and minimize the objective w.r.t. Z.)
(b) Prove the bias-variance decomposition
Error(Pˆ) = Bias(Pˆ) + Var(Pˆ)
where the error, bias and variance are given by
Error( , Bias(Pˆ) = DKL(P||R), .
(Hint: as a first step, it can be useful to show that E[logRi − logPˆi] does not depend on the index i.)
Exercise 3: Programming (50 P)
Download the programming files on ISIS and follow the instructions.
Part 1: The James-Stein Estimator (20 P)
Let x1, …, xN ∈ Rd be independent draws from a multivariate Gaussian distribution with mean vector μ and covariance matrix Σ = σ2I. It can be shown that the maximum-likelihood estimator of the mean parameter μ is the empirical mean given by:
N
xi
N
i=1
Maximum-likelihood appears to be a strong estimator. However, it was demonstrated that the following estimator
μˆJS = ML
‖μML‖
(a shrinked version of the maximum-likelihood estimator towards the origin) has actually a smaller distance from the true mean when d ≥ 3. This however assumes knowledge of the variance of the distribution for which the mean is estimated. This estimator is called the James-Stein estimator. While the proof is a bit involved, this fact can be easily demonstrated empirically through simulation. This is the object of this exercise.
The code below draws ten 50-dimensional points from a normal distribution with mean vector μ = (1, …, 1) and covariance Σ = I.
In [2]:
Implementing the James-Stein Estimator (10 P)
Based on the ML estimator function, write a function that receives as input the data (Xi)ni=1 and the (known) variance σ2 of the generating distribution, and computes the James-Stein estimator
Comparing the ML and James-Stein Estimators (10 P)
We would like to compute the error of the maximum likelihood estimator and the James-Stein estimator for 100 different samples (where each sample consists of 10 draws generated by the function getdata with a different random seed). Here, for reproducibility, we use seeds from 0 to 99. The error should be measured as the Euclidean distance between the true mean vector and the estimated mean vector.
Compute the maximum-likelihood and James-Stein estimations.
Measure the error of these estimations.
Build a scatter plot comparing these errors for different samples.
Part 2: Bias/Variance Decomposition (30 P)
In this part, we would like to implement a procedure to find the bias and variance of different predictors. We consider one for regression and one for classification. These predictors are available in the module utils.
utils.ParzenRegressor : A regression method based on Parzen window. The hyperparameter corresponds to the scale of the Parzen window.
A large scale creates a more rigid model. A small scale creates a more flexible one.
utils.ParzenClassifier : A classification method based on Parzen window. The hyperparameter corresponds to the scale of the Parzen
window. A large scale creates a more rigid model. A small scale creates a more flexible one. Note that instead of returning a single class for a given data point, it outputs a probability distribution over the set of possible classes.
Each class of predictor implements the following three methods:
__init__(self,parameter): Create an instance of the predictor with a certain scale parameter. fit(self,X,T): Fit the predictor to the data (a set of data points X and targets T ). predict(self,X): Compute the output values arbitrary inputs X .
To compute the bias and variance estimates, we require multiple samples from the training set for a single set of observation data. To acomplish this, we utilize the Sampler class provided. The sampler is initialized with the training data and passed to the method for estimating bias and variance, where its function sampler.sample() is called repeatedly in order to fit multiple models and create an ensemble of prediction for each test data point.
Regression Case (15 P)
For the regression case, Bias, Variance and Error are given by:
Bias(Y)2 = (EY[Y − T])2
Var(Y) = EY[(Y − EY[Y])2]
Error(Y) = EY[(Y − T)2]
Task: Implement the KL-based Bias-Variance Decomposition defined above. The function should repeatedly sample training sets from the sampler (as many times as specified by the argument nbsamples), learn the predictor on them, and evaluate the variance on the out-of-sample distribution given by X and T.
Your implementation can be tested with the following code:
Classification Case (15 P)
We consider here the Kullback-Leibler divergence as a measure of classification error, as derived in the exercise, the Bias, Variance decomposition for such error is:
Bias(Y) = DKL(T | | R)
Var(Y) = EY[DKL(R | | Y)]
Error(Y) = EY[DKL(T | | Y)]
where R is the distribution that minimizes its expected KL divergence from the estimator of probability distribution Y (see the theoretical exercise for how it is computed exactly), and where T is the target class distribution.
Task: Implement the KL-based Bias-Variance Decomposition defined above. The function should repeatedly sample training sets from the sampler (as many times as specified by the argument nbsamples), learn the predictor on them, and evaluate the variance on the out-of-sample distribution given by X and T.
Your implementation can be tested with the following code: