1 [Bayesian interpretation of ridge regression] Consider the following data generating process for linear regression problem in Rd. Nature first selects d weight coefficients w1,...,wd as wi ∼ N(0,τ2) i.i.d. Given n examples x1,...,xn ∈ Rd, nature generates the output variable yi as
,
where i.i.d.
Show that finding the coefficients w1,...,wd that maximizes P[w1,...,wd|(x1,y1)...,(xn,yn)] is equivalent to minimizing the ridge optimization criterion.
2 [Combining multiple classifiers] The concept of “wisdom-of-the-crowd” posits that collective knowledge of a group as expressed through their aggregated actions or opinions is superior to the decision of any one individual in the group. Here we will study a version of the “wisdom-of-the-crowd” for binary classifiers: how can one combine prediction outputs from multiple possibly low-quality binary classifiers to achieve an aggregate high-quality final output? Consider the following iterative procedure to combine classifier results.
Input:
- S – a set of training samples: S = {(x1,y1),...,(xm,ym)}, where each yi ∈ {−1,+1}
- T – number of iterations (also, number of classifiers to combine)
- F – a set of (possibly low-quality) classifiers. Each f ∈ F, is of the form f : X → {−1,+1}
Output:
- F – a set of selected classifiers {f1,...,fT}, where each fi ∈ F. - A – a set of combination weights {α1,...,αT}
Iterative Combination Procedure:
- Initialize distribution weights
- for t = 1,...,T do
- is weighted error of j-th classifier w.r.t. Dt
- Define [for each fj ∈ F]
- // select the classifier with the smallest (weighted) error - ft = argmin
- // recompute weights w.r.t. performance of ft
- Compute classifier weight
- Compute distribution weight Dt+1(i) = Dt(i)exp(−αtyift(xi))
- Normalize distribution weights
- endfor
- return weights αt, and classifiers ft for t = 1,...,T.
Final Combined Prediction:
- For any test input x, define the aggregation function as: g(x) := Pt αtft(x), and return the prediction as sign(g(x)).
We’ll prove the following statement: If for each iteration t there is some γt 0 such that
(that is, assuming that at each iteration the error of the classifier ft is just γt better
than random guessing), then error of the aggregate classifier
err .
That is, the error of the aggregate classifier g decreases exponentially fast with the number of combinations T!
(i) Let Zt := Pi Dt+1(i) (i.e., Zt denotes the normalization constant for the weighted distribution Dt+1). Show that
.
(ii) Show that error of the aggregate classifier g is upper bounded by the product of Zt:
err(g) ≤ Qt Zt.
(hint: use the fact that 0-1 loss is upper bounded by exponential loss)
(iii) Show that .
(hint: noting Zt = Pi Dt(i)exp(−αtyift(xi)), separate the expression for correctly and incorrectly classified cases and express it in terms of
(iv) By combining results from (ii) and (iii), we have that err , now show that:
.
Thus establishing that err(g) ≤ exp(−2Pt γt2).
3 [Low-dimensionalinformation-preservingtransformations] (hashingthecube) You have a collection of nonzero distinct binary vectors x1,...,xm ∈ {0,1}n. To facilitate later lookup, you decide to hash them to vectors of length p < n by means of a linear mapping xi 7→ Axi, where A is a p × n matrix with 0-1 entries, and all computations are performed modulo 2. Suppose the entries of the matrix are picked uniformly at random (ie, each an independent coin toss)
(i) Pick any 1 ≤ i ≤ m, and any b ∈ {0,1}p. Show that the probability (over the choice of A) that xi hashes to b is exactly 1/2p. (Hint: focus on a coordinate 1 ≤ j ≤ n for which xij = 1.)
(ii) Pick any 1 ≤ i < j ≤ m. What is the probability that xi and xj hash to the same vector? This is called a collision.
(iii) Show that if p ≥ 2log2 m, then with probability at least 1/2, there are no collisions among the xi. Thus: to avoid collisions, it is enough to linearly hash into O(logm) dimensions!
(question credit: Prof. Sanjoy Dasgupta)
4 [Strange consequences of high dimensionality] As discussed in class, we often represent our data in high dimensions. Thus to understand our data better and design effective prediction algorithms, it is good to understand how things behave in high dimensions. Obviously, since we cannot visualize or imagine high dimensional spaces, we often tend to rely on how data behave in one-, two- or three-dimensions and extrapolate how they may behave in hundreds of dimensions. It turns out that our low dimensional intuition can be very misleading about data and distributions in high dimensional spaces. In this problem we will explore this in more detail.
Consider the Gaussian distribution with mean µ and identity covariance Id in Rd. Recall that the density assigned to any point x ∈ Rd, then becomes
.
(i) Show that when x = µ, x gets assigned the highest density.
(This, of course, makes sense: the Gaussian density peaks at its mean and thus x = µ has the highest density.)
(ii) If mean has the highest density, it stands to reason that if we draw a large i.i.d. sample from the distribution, then a large fraction of the points should lie close to the mean. Let’s try to verify this experimentally. For simplicity, let mean µ = 0 (covariance is still Id). Draw 10,000 points i.i.d. from a Gaussian N(0,Id).
To see how far away a sampled datapoint is from the mean, we can look at the distance kx − µk2 = kxk2 (that is, the squared length of the sampled datapoint, when mean is zero). Plot the histogram of squared length of the samples, for dimensions d = 1,2,3,5,10,50 and 100. You should plot the all these histograms on the same figure for a better comparison.
What interesting observations do you see from this plot? Do you notice anything strange when the samples that were drawn from the high dimensional Gaussian distribution? Do most of the samples lie close to the mean?
(iii) Let’s mathematically derive where we expect these samples to lie. That is, calculate
Ex∼N(0,Id)hkxk2i.
Is the empirical plot in part (ii) in agreement with the mathematical expression you derived here?
(iv) This “strangeness” is not specific to Gaussian distribution, you can observe something similar even for the simplest of distributions in high dimensions. Consider the uniform distribution over the cube [−1,1]d. Just like in part (ii), draw 10,000 i.i.d. samples from this d-dimensional cube with uniform density, and plot the histogram of how far away from the origin the sample points lie. (do this for d = 1,2,3,5,10,50 and 100, again on the same plot).
Recall that the cube has side length of 2, while most of the high-dimensional samples have length of far more than 2! This means even though you are drawing uniformly from the cube, most of your samples lie in the corners (and not the interior) of the cube!
(v) Again, calculate the expected (squared) length of the samples. That is, calculate
Ex∼unif([−1,1]d)hkxk2i.
Does the plot in part (iv) in agreement with the expression you derive here?