$25
Exercise 1: Maximum-Likelihood Estimation (5+5+5+5 P)
We consider the problem of estimating using the maximum-likelihood approach the parameters ,⌘ > 0 of the probability distribution:
p(x,y) = ⌘e x ⌘y
supported on R2+. We consider a dataset D = ((x1,y1),...,(xN,yN)) composed of N independent draws from this distribution.
(a) Show that x and y are independent.
(b) Derive a maximum likelihood estimator of the parameter based on D.
(c) Derive a maximum likelihood estimator of the parameter based on D under the constraint ⌘ = 1/ .
(d) Derive a maximum likelihood estimator of the parameter based on D under the constraint ⌘ = 1 .
Exercise 2: Maximum Likelihood vs. Bayes (5+10+15 P)
An unfair coin is tossed seven times and the event (head or tail) is recorded at each iteration. The observed sequence of events is
D = (x1,x2,...,x7) = (head,head,tail,tail,head,head,head).
We assume that all tosses x1,x2,... have been generated independently following the Bernoulli probability distribution
✓ if x = head
P(x | ✓) = ⇢
1 ✓ if x = tail,
where ✓2 [0,1] is an unknown parameter.
(a) State the likelihood function P(D|✓), that depends on the parameter ✓.
(b) Compute the maximum likelihood solution ✓ˆ, and evaluate for this parameter the probability that the next two tosses are “head”, that is, evaluate P(x8 = head, x9 = head | ✓ˆ).
(c) We now adopt a Bayesian view on this problem, where we assume a prior distribution for the parameter ✓ defined as:
p(✓) = ⇢ 1 if 0 ✓ 1
0 else.
Compute the posterior distribution p(✓|D), and evaluate the probability that the next two tosses are head, that is,
Z P(x8 = head, x9 = head | ✓)p(✓|D)d✓.
Exercise 3: Convergence of Bayes Parameter Estimation (5+5 P)
We consider Section 3.4.1 of Duda et al., where the data is generated according to the univariate probability density p(x|µ) ⇠ N(µ, 2), where 2 is known and where µ is unknown with prior distribution p(µ) ⇠ N(µ0, 02). Having sampled a dataset D from the data-generating distribution, the posterior probability distribution over the unknown parameter µ becomes p(µ|D) ⇠ N(µn, n2), where
.
(a) Show that the variance of the posterior can be upper-bounded as n2 min( 2/n, 02), that is, the variance of the posterior is contained both by the uncertainty of the data mean and of the prior.
(b) Show that the mean of the posterior can be lower- and upper-bounded as min(ˆµn,µ0) µn max(ˆµn,µ0), that is, the mean of the posterior distribution lies somewhere on the segment between the mean of the prior distribution and the sample mean.
Exercise 4: Programming (40 P)
Download the programming files on ISIS and follow the instructions.
Maximum Likelihood Parameter Estimation
In this first exercise, we would like to use the maximum-likelihood method to estimate the best parameter of a data density model p(x | θ) with respect to some dataset D = (x1, …, xN), and use that approach to build a classifier. Assuming the data is generated independently and identically distributed (iid.), the dataset likelihood is given by
N p(D | θ) = ∏p(xk | θ)
k=1
and the maximum likelihood solution is then computed as
θˆ = arg max p(D | θ)
θ
= arg max logp(D | θ)
θ
where the log term can also be expressed as a sum, i.e.
N
logp(D | θ) = ∑ logp(xk | θ).
k=1
As a first step, we load some useful libraries for numerical computations and plotting.
We now consider the univariate data density model
p(x | θ) =
also known as the Cauchy distribution with fixed parameter γ = 1, and with parameter θ unknown. Compared to the Gaussian distribution, the Cauchy distribution is heavy-tailed, and this can be useful to handle the presence of outliers in the data generation process. The probability density function is implemented below.
In [2]:
Note that the function can be called with scalars or with numpy arrays, and if feeding arrays of different shape, numpy broadcasting rules will apply. Our first step will be to implement a function that estimates the optimal parameter θˆ in the maximum likelihood sense for some dataset D.
Task (10 P):
Implement a function that takes a dataset D as input (given as one-dimensional array of numbers) and a list of candidate parameters θ (also given as a one-dimensional array), and returns a one-dimensional array containing the log-likelihood w.r.t. the dataset D for each parameter θ.
We observe that the likelihood has two peaks: one around θ = − 0.5 and one around θ = 2. However, the highest peak is the second one, hence, the second peak is retained as a maximum likelihood solution.
Building a Classifier
We now would like to use the maximum likelihood technique to build a classifier. We consider a labeled dataset where the data associated to the two classes are given by:
In [5]:
D1 = numpy.array([ 2.803, -1.563, -0.853, 2.212, -0.334, 2.503])
D2 = numpy.array([-4.510, -3.316, -3.050, -3.108, -2.315])
To be able to classify new data points, we consider the discriminant function
g(x) = logP(x | θˆ1) − logP(x | θˆ2) + logP(ω1) − logP(ω2)
were the first two terms can be computed based on our maximum likelihood estimates, and where the last two terms are the prior probabilities. The function g(x) produces the decision ω1 if g(x) > 0 and ω2 if g(x) < 0. We would like to implement a maximum-likelihood based classifier.
Tasks (10 P):
Implement the function fit that receives as input a vector of candidate parameters θ and the dataset associated to each class, and produces the maximum likelihood parameter estimates. (Hint: from your function fit , you can call the function ll you have previously implemented.)
Implement the function predict that takes as input the prior probability for each class and a vector of points X on which to evaluate the discriminant function, and that outputs a vector containing the value of g for each point in X.
implements can be visualized.
Here, we observe that the model essentially learns a threshold classifier with threshold approximately −0.5. However, we note that the threshold seems to be too high to properly classify the data. One reason for this is the fact that maximum likelihood estimate retains only the best parameter. Here, the model for the first class focuses mainly on the peak at x = 2 and treat examples x < 0 as outliers, without considering the possibility that the peak at θ = 2 might actually be the outlier.
Bayes Parameter Estimation
Let us now bypass the computation of a maximum likelihood estimate of parameters and adopt instead a full Bayesian approach. We will consider the same data density model and datasets as in the maximum likelihood exercise but we include now a prior distribution over the parameters. Specifically, we set for both classes the prior distribution:
1 1
p(θ) =
10π 1 + (θ/10)2
Given a dataset D, the posterior distribution for the unknown parameter θ can then be obtained from the Bayes rule:
p(D | θ)p(θ)
p(θ | D) =
∫p(D | θ)p(θ)dθ
The integration can be performed numerically using the trapezoidal rule.
Task (10 P):
Implement the prior and posterior functions below. These function receive as input a vector of parameters θ (assumed to be sorted from smallest to largest, linearly spaced, and covering the range of values where most of the probability mass lies). The posterior function also receive a dataset D as input. Both functions return a vector containing the probability scores associated to each value of θ.
We observe that the posterior distribution is more concentrated to the specific values of the parameter that explain the dataset well. In particular, we observe the same two peaks around θ = − 0.5 and θ = 2 observed in the maximum likelihood exercise.
Building a Classifier
We now would like to build a Bayes classifier based on the discriminant function
h(x) = logP(x | D1) − logP(x | D2) + logP(ω1) − logP(ω2)
where the dataset-conditioned densities are obtained from the original data density model and the parameter posterior as
p(x | Dj) = ∫p(x | θ)p(θ | Dj)dθ
Tasks (10 P):
Implement a function fit that produces the parameter posteriors p(θ | D1) and p(θ | D2).
Implement a function predict computing the new discriminant function h based on the dataset-conditioned data densities.
each point to be predicted.
However, the quality of the prediction also differs compared to that of the maximum likelihood method. In the plot below, we compare the ML and Bayes approaches.
We observe that the Bayes classifier has generally lower output scores and its decision boundary has been noticeably shifted to the left, leading to better predictions for the current data. In this particular case, the difference between the two models can be explained by the fact that the Bayes one better integrates the possibility that negative examples for the first class are not necessarily outliers.