Starting from:

$30

CSE176-Homework 2 Euclidean distance classifier and Bias and Variance of Estimator Solved

Exercise 1: Euclidean distance classifier  A Euclidean distance classifier represents each class k = 1,...,K by a prototype vector µk ∈ RD and classifies a pattern x ∈ RD as the class of its closest prototype: k∗ = argmink=1,...,K kx − µkk. Prove that a Gaussian classifier with shared isotropic covariances (i.e., of the form Σk = σ2I for k = 1,...,K, where σ 0) and equal class priors (i.e., p(C1) = ··· = p(CK) = K1 ) is equivalent to a Euclidean distance classifier. Prove the class discriminant functions g1(x),...,gK(x) are linear and give the expression that defines them.

Exercise 2: bias and variance of an estimator .     Assume we have a sample X =

{x1,...,xN} ⊂ R of N iid (independent identically distributed) scalar random variables, each of which is drawn from a Gaussian distribution N(µ,σ2) with µ ∈ R and σ 0. We want to estimate the mean µ of this Gaussian by computing a statistic of the sample X. Consider the following four different statistics of the sample:

1.    φ1(X) = 7.

2.    φ2(X) = x1.

3.    φ3(X) = N1 PNn=1 xn.

4.    φ4(X) = x1x2.

For each statistic φ, compute:

•    Its bias bµ(φ) = EX {φ(X)} − µ.

•    Its variance var {φ} = EX {(φ(X) − EX {φ(X)})2}.

•    Its mean square error e(φ,µ) = EX {(φ(X) − µ)2}.

Based on that, answer the following for each estimator (statistic): is it unbiased? is it consistent? Hint: expectations wrt the distribution of the N-point sample X are like this one:

EX {φ(X)} = Z φ(x1,...,xN)p(x1,...,xN)dx1 ...dxN iid= Z φ(x1,...,xN)p(x1)...p(xN)dx1 ...dxN.

Exercise 3: PCA and LDA . Consider 2D data points coming from a mixture of two Gaussians with equal proportions, different means, and equal, diagonal covariances (where µ,σ1,σ2 0):

                       x ∈ R2: p(x) = π1 p(x|1) + π2 p(x|2)               p(x|1) ∼ N(µ1,Σ1),        p(x|2) ∼ N(µ2,Σ2),

, µ .

1.    Compute the mean µ and covariance Σ of the mixture distribution p(x).

Hint: let ) for x ∈ RD be a mixture of K densities, where π1,...,πK ∈ [0,1] and = 1 are the component proportions (prior probabilities) and p(x|k), for k = 1,...,K, the component densities (e.g. Gaussian, but not necessarily). Let µ be the mean and covariance of component density k, for k = 1,...,K.

Then, the mean and covariance of the mixture are (you should be able to prove this statement):

µ .

2.Compute the eigenvalues λ1 ≥ λ2 ≥ 0 and corresponding eigenvectors u1,u2 ∈ R2 of Σ. Can we have λ2 0?

3. Find the PCA projection to dimension 1.

4.  Compute the within-class and between-class scatter matrices SW , SB of p.

5. Compute the eigenvalues ν1 ≥ ν2 ≥ 0 and corresponding eigenvectors v1,v2 ∈ R2 of S−W1SB. Can we have ν2 0?

6. Compute the LDA projection.

7.\ When does PCA find the same projection as LDA? Give a condition and explain it.

Exercise 4: variations of k-means clustering (30 points).              Consider the k-means error function:

                                                                           N        K

                                              E({µk}Kk=1,Z) = XXznkkxn − µkk2                     s.t.    Z ∈ {0,1}NK, Z1 = 1

n=1 k=1

over the centroids µ1,...,µK and cluster assignments ZN×K, given training points x1,...,xN ∈ RD.

•    Variation 1: in k-means, the centroids can take any value in RD: µk ∈ RD ∀k = 1,...,K. Now we want the centroids to take values from among the training points only: µk ∈ {x1,...,xN} ∀k = 1,...,K.

1. Design a clustering algorithm that minimizes the k-means error function but respecting the above constraint. Your algorithm should converge to a local optimum of the error function. Give the steps of the algorithm explicitly.

2.Can you imagine when this algorithm would be useful, or preferable to k-means?

•    Variation 2: in k-means, we seek K clusters, each characterized y by a centroid µk. Imagine we seek instead K lines (or hyperplanes, in general), each characterized by a weight vector wk ∈ RD and bias wk0 ∈ R, given a supervised dataset  (see figure). Data points assigned to line k should have minimum least-squares error Pn∈line k (yn − wkT xn − wk0)2. x

1.  Give an error function that allows us to learn the lines’ parameters .

2. Give an iterative algorithm that minimizes that error function.

Exercise 5: mean-shift algorithm.             Consider a Gaussian kernel density estimate

                                                 x ∈ RD.

Derive the mean-shift algorithm, which iterates the following expression:

                   x          where

until convergence to a maximum of p (or, in general, a stationary point of p, satisfying ∇p(x) = 0). Hint: take the gradient of p wrt x, equate it to zero and rearrange the resulting expression.

Bonus exercise: nonparametric regression . Consider the Gaussian kernel smoother

              g    where         

estimated on a training set .

1.  What is g(x) if the training set has only one point (N = 1)? Explain.

Sketch the solution in 1D (i.e., when both xn,yn ∈ R). Compare with using a least-squares linear regression.

2.   Prove that, with N = 2 points, we can write g(x) = α(x)y1 + (1 − α(x))y2 where α(x) can be written using the logistic function. Give the detailed expression for α(x).

Sketch the solution in 1D.

Compare with using a least-squares linear regression.

More products