$25
Uniform convergence
You are hired by CNN to help design the sampling procedure for making their electoral predictions for the next presidential election in the (fictitious) country of Elbania.
The country of Elbania is organized into states, and there are only two candidates running in this election: One from the Elbanian Democratic party, and another from the Labor Party of Elbania. The plan for making our electorial predictions is as follows: We’ll sample m voters from each state, and ask whether they’re voting democrat. We’ll then publish, for each state, the estimated fraction of democrat voters. In this problem, we’ll work out how many voters we need to sample in order to ensure that we get good predictions with high probability.
One reasonable goal might be to set m large enough that, with high probability, we obtain uniformly accurate estimates of the fraction of democrat voters in every state. But this might require surveying very many people, which would be prohibitively expensive. So, we’re instead going to demand only a slightly lower degree of accuracy.
Specifically, we’ll say that our prediction for a state is “highly inaccurate” if the estimated fraction of democrat voters differs from the actual fraction of democrat voters within that state by more than a tolerance factor γ. CNN knows that their viewers will tolerate some small number of states’ estimates being highly inaccurate; however, their credibility would be damaged if they reported highly inaccurate estimates for too many states. So, rather than trying to ensure that all states’ estimates are within γ of the true values (which would correspond to no state’s estimate being highly inaccurate), we will instead try only to ensure that the number of states with highly inaccurate estimates is small.
To formalize the problem, let there be n states, and let m voters be drawn IID from each state. Let the actual fraction of voters in state i that voted democrat be φi. Also let Xij (1 ≤ i ≤ n,1 ≤ j ≤ m) be a binary random variable indicating whether the j-th randomly chosen voter from state i voted democrat:
1 if the jth example from the ith state voted democrat
ij =
0 otherwise
We assume that the voters correctly disclose their vote during the survey. Thus, for each value of i, we have that Xij are drawn IID from a Bernoulli(φi) distribution. Moreover, the Xij’s (for all i,j) are all mutually independent.
After the survey, the fraction of democrat votes in state i is estimated as:
Also, let Zi = 1{|φˆi − φi| γ} be a binary random variable that indicates whether the prediction in state i was highly inaccurate.
Let ψi be the probability that Zi = 1. Using the Hoeffding inequality, find an upper bound on ψi.
In this part, we prove a general result which will be useful for this problem. Let Vi and Wi (1 ≤ i ≤ k) be Bernoulli random variables, and suppose
E[Vi] = P(Vi = 1) ≤ P(Wi = 1) = E[Wi] ∀i ∈ {1,2,...k}
Let the Vi’s be mutually independent, and similarly let the Wi’s also be mutually independent. Prove that, for any value of t, the following holds: !
[Hint: One way to do this is via induction on k. If you use a proof by induction, for the base case (k = 1), you must show that the inequality holds for t < 0, 0 ≤ t < 1, and t ≥ 1.]
The fraction of states on which our predictions are highly inaccurate is given by
. Prove a reasonable closed form upper bound on the probability P(Z τ) of being highly inaccurate on more than a fraction τ of the states.
[Note: There are many possible answers, but to be considered reasonable, your bound must decrease to zero as m → ∞ (for fixed n and τ 0). Also, your bound should either remain constant or decrease as n → ∞ (for fixed m and τ 0). It is also fine
if, for some values of τ, m and n, your bound just tells us that P(Z τ) ≤ 1 (the trivial bound).]
2. More VC dimension
Let the domain of the inputs for a learning problem be X = R. Consider using hypotheses of the following form:
hθ(x) = 1{θ0 + θ1x + θ2x2 + ··· + θdxd ≥ 0},
and let H = {hθ : θ ∈ Rd+[1]} be the corresponding hypothesis class. What is the VC dimension of H? Justify your answer.
[Hint: You may use the fact that a polynomial of degree d has at most d real roots. When doing this problem, you should not assume any other non-trivial result (such as that the VC dimension of linear classifiers in d-dimensions is d + 1) that was not formally proved in class.]
3. LOOCV and SVM
Linear Case. Consider training an SVM using a linear Kernel K(x,z) = xTz on a training set {(x(i),y(i)) : i = 1,...,m} that is linearly separable, and suppose we do not use ℓ1 Let |SV | be the number of support vectors obtained when training on the entire training set. (Recall x(i) is a support vector if and only if αi 0.) Let ˆεLOOCV denote the leave one out cross validation error of our SVM. Prove that
εˆLOOCV.
General Case. Consider a setting similar to in part (a), except that we now run an SVM using a general (Mercer) kernel. Assume that the data is linearly separable in the high dimensional feature space corresponding to the kernel. Does the bound in part (a) on ˆεLOOCV still hold? Justify your answer.
4. MAP estimates and weight decay
Consider using a logistic regression model hθ(x) = g(θTx) where g is the sigmoid function, and let a training set {(x(i),y(i));i = 1,...,m} be given as usual. The maximum likelihood estimate of the parameters θ is given by
m θ Y
θML = argmax p(y(i)|x(i);θ).
i=1
If we wanted to regularize logistic regression, then we might put a Bayesian prior on the parameters. Suppose we chose the prior θ ∼ N(0,τ2I) (here, τ 0, and I is the n+1-byn + 1 identity matrix), and then found the MAP estimate of θ as:
m θ Y
θMAP = argmaxp(θ) p(y(i)|x(i),θ)
i=1
Prove that
||θMAP||2 ≤ ||θML||2
[Hint: Consider using a proof by contradiction.]
Remark. For this reason, this form of regularization is sometimes also called weight decay, since it encourages the weights (meaning parameters) to take on generally smaller values.
5. KL divergence and Maximum Likelihood
The Kullback-Leibler (KL) divergence between two discrete-valued distributions P(X),Q(X) is defined as follows:1
K
For notational convenience, we assume P(x) 0,∀x. (Otherwise, one standard thing to do is to adopt the convention that “0log0 = 0.”) Sometimes, we also write the KL divergence as KL(P||Q) = KL(P(X)||Q(X)).
The KL divergence is an assymmetric measure of the distance between 2 probability distributions. In this problem we will prove some basic properties of KL divergence, and work out a relationship between minimizing KL divergence and the maximum likelihood estimation that we’re familiar with.
(a) Nonnegativity. Prove the following:
∀P,Q KL(PkQ) ≥ 0
and
KL(PkQ) = 0 if and only if P = Q.
[Hint: You may use the following result, called Jensen’s inequality. If f is a convex function, and X is a random variable, then E[f(X)] ≥ f(E[X]). Moreover, if f is strictly convex (f is convex if its Hessian satisfies H ≥ 0; it is strictly convex if H 0; for instance f(x) = −logx is strictly convex), then E[f(X)] = f(E[X]) implies that X = E[X] with probability 1; i.e., X is actually a constant.]
Chain rule for KL divergence. The KL divergence between 2 conditional distributions P(X|Y ),Q(X|Y ) is defined as follows: K!
This can be thought of as the expected KL divergence between the corresponding conditional distributions on x (that is, between P(X|Y = y) and Q(X|Y = y)), where the expectation is taken over the random y. Prove the following chain rule for KL divergence:
KL(P(X,Y )kQ(X,Y )) = KL(P(X)kQ(X)) + KL(P(Y |X)kQ(Y |X)).
KL and maximum likelihood.
Consider a density estimation problem, and suppose we are given a training set
{x(i);i = 1,...,m}. Let the empirical distribution be.
(Pˆ is just the uniform distribution over the training set; i.e., sampling from the empirical distribution is the same as picking a random example from the training set.) Suppose we have some family of distributions Pθ parameterized by θ. (If you like, think of Pθ(x) as an alternative notation for P(x;θ).) Prove that finding the maximum likelihood estimate for the parameter θ is equivalent to finding Pθ with minimal KL divergence from Pˆ. I.e. prove:
m
argminKL(PˆkPθ) = argmax logPθ(x(i)) θ θ X
i=1
Remark. Consider the relationship between parts (b-c) and multi-variate Bernoulli Naive Bayes parameter estimation. In the Naive Bayes model we assumed Pθ is of the following form: ). By the chain rule for KL divergence, we therefore have:
n
KL(PˆkPθ) = KL(Pˆ(y)kp(y)) + XKL(Pˆ(xi|y)kp(xi|y)).
i=1
This shows that finding the maximum likelihood/minimum KL-divergence estimate of the parameters decomposes into 2n + 1 independent optimization problems: One for the class priors p(y), and one for each of the conditional distributions p(xi|y) for each feature xi given each of the two possible labels for y. Specifically, finding the maximum likelihood estimates for each of these problems individually results in also maximizing the likelihood of the joint distribution. (If you know what Bayesian networks are, a similar remark applies to parameter estimation for them.)
6. K-means for compression
In this problem, we will apply the K-means algorithm to lossy image compression, by reducing the number of colors used in an image.
The directory /afs/ir.stanford.edu/class/cs229/ps/ps3/ contains a 512x512 image of a mandrill represented in 24-bit color. This means that, for each of the 262144 pixels in the image, there are three 8-bit numbers (each ranging from 0 to 255) that represent the red, green, and blue intensity values for that pixel. The straightforward representation of this image therefore takes about 262144 × 3 = 786432 bytes (a byte being 8 bits). To compress the image, we will use K-means to reduce the image to k = 16 colors. More specifically, each pixel in the image is considered a point in the three-dimensional (r,g,b)space. To compress the image, we will cluster these points in color-space into 16 clusters, and replace each pixel with the closest cluster centroid.
Follow the instructions below. Be warned that some of these operations can take a while (several minutes even on a fast computer)![2]
Copy mandrill-large.tiff from /afs/ir.stanford.edu/class/cs229/ps/ps3 on the leland system. Start up MATLAB, and type A = double(imread(’mandrill-large.tiff’)); to read in the image. Now, A is a “three dimensional matrix,” and A(:,:,1), A(:,:,2) and A(:,:,3) are 512x512 arrays that respectively contain the red, green, and blue values for each pixel. Enter imshow(uint8(round(A))); to display the image.
Since the large image has 262144 pixels and would take a while to cluster, we will instead run vector quantization on a smaller image. Repeat (a) with mandrill-small.tiff. Treating each pixel’s (r,g,b) values as an element of R3, run K-means3 with 16 clusters on the pixel data from this smaller image, iterating (preferably) to convergence, but in no case for less than 30 iterations. For initialization, set each cluster centroid to the (r,g,b)-values of a randomly chosen pixel in the image.
Take the matrix A from mandrill-large.tiff, and replace each pixel’s (r,g,b) values with the value of the closest cluster centroid. Display the new image, and compare it visually to the original image. Hand in all your code and a printout of your compressed image (printing on a black-and-white printer is fine).
6
If we represent the image with these reduced (16) colors, by (approximately) whatfactor have we compressed the image?