Exercise 1: Euclidean distance classifier A Euclidean distance classifier represents each class k = 1,...,K by a prototype vector µk ∈ RD and classifies a pattern x ∈ RD as the class of its closest prototype: k∗ = argmink=1,...,K kx − µkk. Prove that a Gaussian classifier with shared isotropic covariances (i.e., of the form Σk = σ2I for k = 1,...,K, where σ 0) and equal class priors (i.e., p(C1) = ··· = p(CK) = K1 ) is equivalent to a Euclidean distance classifier. Prove the class discriminant functions g1(x),...,gK(x) are linear and give the expression that defines them.
Exercise 2: bias and variance of an estimator . Assume we have a sample X =
{x1,...,xN} ⊂ R of N iid (independent identically distributed) scalar random variables, each of which is drawn from a Gaussian distribution N(µ,σ2) with µ ∈ R and σ 0. We want to estimate the mean µ of this Gaussian by computing a statistic of the sample X. Consider the following four different statistics of the sample:
1. φ1(X) = 7.
2. φ2(X) = x1.
3. φ3(X) = N1 PNn=1 xn.
4. φ4(X) = x1x2.
For each statistic φ, compute:
• Its bias bµ(φ) = EX {φ(X)} − µ.
• Its variance var {φ} = EX {(φ(X) − EX {φ(X)})2}.
• Its mean square error e(φ,µ) = EX {(φ(X) − µ)2}.
Based on that, answer the following for each estimator (statistic): is it unbiased? is it consistent? Hint: expectations wrt the distribution of the N-point sample X are like this one:
EX {φ(X)} = Z φ(x1,...,xN)p(x1,...,xN)dx1 ...dxN iid= Z φ(x1,...,xN)p(x1)...p(xN)dx1 ...dxN.
Exercise 3: PCA and LDA . Consider 2D data points coming from a mixture of two Gaussians with equal proportions, different means, and equal, diagonal covariances (where µ,σ1,σ2 0):
x ∈ R2: p(x) = π1 p(x|1) + π2 p(x|2) p(x|1) ∼ N(µ1,Σ1), p(x|2) ∼ N(µ2,Σ2),
, µ .
1. Compute the mean µ and covariance Σ of the mixture distribution p(x).
Hint: let ) for x ∈ RD be a mixture of K densities, where π1,...,πK ∈ [0,1] and = 1 are the component proportions (prior probabilities) and p(x|k), for k = 1,...,K, the component densities (e.g. Gaussian, but not necessarily). Let µ be the mean and covariance of component density k, for k = 1,...,K.
Then, the mean and covariance of the mixture are (you should be able to prove this statement):
µ .
2.Compute the eigenvalues λ1 ≥ λ2 ≥ 0 and corresponding eigenvectors u1,u2 ∈ R2 of Σ. Can we have λ2 0?
3. Find the PCA projection to dimension 1.
4. Compute the within-class and between-class scatter matrices SW , SB of p.
5. Compute the eigenvalues ν1 ≥ ν2 ≥ 0 and corresponding eigenvectors v1,v2 ∈ R2 of S−W1SB. Can we have ν2 0?
6. Compute the LDA projection.
7.\ When does PCA find the same projection as LDA? Give a condition and explain it.
Exercise 4: variations of k-means clustering (30 points). Consider the k-means error function:
N K
E({µk}Kk=1,Z) = XXznkkxn − µkk2 s.t. Z ∈ {0,1}NK, Z1 = 1
n=1 k=1
over the centroids µ1,...,µK and cluster assignments ZN×K, given training points x1,...,xN ∈ RD.
• Variation 1: in k-means, the centroids can take any value in RD: µk ∈ RD ∀k = 1,...,K. Now we want the centroids to take values from among the training points only: µk ∈ {x1,...,xN} ∀k = 1,...,K.
1. Design a clustering algorithm that minimizes the k-means error function but respecting the above constraint. Your algorithm should converge to a local optimum of the error function. Give the steps of the algorithm explicitly.
2.Can you imagine when this algorithm would be useful, or preferable to k-means?
• Variation 2: in k-means, we seek K clusters, each characterized y by a centroid µk. Imagine we seek instead K lines (or hyperplanes, in general), each characterized by a weight vector wk ∈ RD and bias wk0 ∈ R, given a supervised dataset (see figure). Data points assigned to line k should have minimum least-squares error Pn∈line k (yn − wkT xn − wk0)2. x
1. Give an error function that allows us to learn the lines’ parameters .
2. Give an iterative algorithm that minimizes that error function.
Exercise 5: mean-shift algorithm. Consider a Gaussian kernel density estimate
x ∈ RD.
Derive the mean-shift algorithm, which iterates the following expression:
x where
until convergence to a maximum of p (or, in general, a stationary point of p, satisfying ∇p(x) = 0). Hint: take the gradient of p wrt x, equate it to zero and rearrange the resulting expression.
Bonus exercise: nonparametric regression . Consider the Gaussian kernel smoother
g where
estimated on a training set .
1. What is g(x) if the training set has only one point (N = 1)? Explain.
Sketch the solution in 1D (i.e., when both xn,yn ∈ R). Compare with using a least-squares linear regression.
2. Prove that, with N = 2 points, we can write g(x) = α(x)y1 + (1 − α(x))y2 where α(x) can be written using the logistic function. Give the detailed expression for α(x).
Sketch the solution in 1D.
Compare with using a least-squares linear regression.