$25
Homework 2 is to let you better understand the curse of dimensionality in theory and practice.
• HW2.pdf
Problem 1: The Curse of Dimensionality
In this problem, we will study a high-dimensional statistical phenomena called the Curse of Dimensionality. Originally coined by Bellman in the context of dynamic programming, it generally refers to the failure of certain statistical or algorithmic procedures to scale efficiently as the input dimension grows.
We will first study this curse through the geometry of the unit d-dimensional `2 sphere. Recall that the `p norm of a vector x ∈ Rd is defined as kxkp := (Pi |xi|p)1/p, for p ∈ [1,∞). To get a first grasp on the phenomena, we will first consider a ‘Gaussian’ approximation of the d-dimensional unit sphere, by considering a Gaussian random vector X ∼ N (0, I/d), where I is the d×d identity matrix.
1. Using the Central Limit Theorem, show that kXk2 = 1 +O(1/pd) with high probability. In other words, show thatp kXk2 is a random variable with expectation 1 and standard deviation proportional to 1/ d. [Alternative: use the χ-squared distribution with d degrees of freedom]. (1 point)
2. Numerically verify this property by simulating kXk2 in dimensions d ∈ {10,100,1000,10000} using a sample size of n = 1000. (2 points)
This means that for large input dimension d, a draw X from the N (0, I/d) Gaussian distribution will concentrate to the unit sphere kXk2 ≈ 1. Let us also verify that the Gaussian distribution is rotationally invariant:
3. Let R ∈Rd×d be any unitary matrix, ie R>R = RR> = I. Show that the pdf of X ∼ N (0, I/d) and X˜ = RX are the same, ie they do not depend on the choice of R. (2 points)
For our purposes, we will thus use Gaussian N (0, I/d) samples as de facto draws from the unit ddimensional sphere.
4. If we draw two datapoints X, X0 i.i.d. from N (0, I/d), show that |〈X, X0〉| = O(1/pd) with high probability using again the CLT. In other words, show thatp 〈X, X0〉 is a random variable of> zero mean and standard deviation proportional to 1/ d. Conclude that for a constant C 0, kX − X0k∈ (p2±C/pd) for large d with high probability. (2 points)
This property reflects the intuition that independent draws from a high-dimensional Gaussian distribution (or any rotationally-invariant distribution, more generally) are nearly orthogonal as d → ∞, since |〈X, X0〉| will have variance going to 0 as d →∞.
We will now combine this intuition with a simple supervised learning setup. Assume a target function f ∗ : Rd → R is β-Lipschitz, ie |f ∗(x) − f ∗(x0)| É βkx − x0k, and a dataset {(xi, yi)}i...n with xi ∼ N (0, I/d) and yi = f ∗(xi) drawn independently. We will consider the Nearest-Neighbor estimator fˆNN given by fˆNN(x) := yi(x) , where i(x) = argminkx−xik .
i
5. Show that E|fˆNN(x) − f ∗(x)| É βEmini kx − xik, where the expectation is taken over the test sample x and the training sample {xi}. (2 points)
6. Let En denote the expectation of mini=1...nYi, where Yi ∼ N (0,1) i.i.d. Show that if now Xi ∼ N (µ,σ2) i.i.d., then Emini Xi =µ+σEn. (1 point) 7. Using the fact thatp + kx−xik andk k−x−xkjk→are conditionally independent, givenp > x, and assumingk −
d the asymptotic Gaussianity of X X0 N ( 2,C/d) for a constant C 0, show that Emini x pC
xik∼ 2 pd En. (1 point)
8. Using the fact that En ≈−p2logn, conclude that as long as logn ¿ d, the generalisation bound in (5) is vacuous (in other words, unless the sample size is exponential in dimension, the bound does not provide any useful learning guarantee). (2 points)
While this argument shows that our Lipschitz-based upper bound is cursed by dimension, let us conclude this exercise by showing that the exponential dependency in dimension is necessary. For simplicity, we will replace the d-dimensional `2 sphere by the `∞ sphere, ie the cube B = [−1,1]d. Let Ω = [−1/2,1/2]d denote a smaller cube. For x ∈ Ω, let Ψ(x) = dist(x,∂Ω). In words, Ψ(x) is the distance from x to the boundary of the cube Ω.
9. Show that Ψ is 1-Lipschitz. [Hint: Use the triangular inequality, and the definition of Ψ as Ψ(x) = miny∈∂Ωkx− yk. Also, drawing Ψ in two-dimensions might help!] (2 points)
We will now use this 1-Lipschitz function supported in the ‘small’ cube Ω to construct a hard-to-learn function f ∗ defined in the large cube B = [−1,1]d.
10. Verify that we can fit 2d copies of Ω into B, by translating copies of Ω appropriately [Hint: Use the fact that both Ω and B are separable in the standard basis. Also, drawing this in two-dimensions should help]. (2 points)
Let now g ∈ {±1}2d be a binary string of length 2d, that we index using d binary variables z1 = ±1,... zd =±1. We define
f ∗(x) = X g(z)Ψ(x−z/2) . (0.1)
z=z1=±1,...zd=±1
In words, f ∗ is constructed by tiling 2d shifted versions of the window Ψ, and flipping the sign of each tile with the bit g(z). From part 9, the support of f ∗ is the ‘large’ cube B.
11. Verify that f ∗ is 1-Lipschitz. [Hint: Given two points x, x0 ∈ B, consider the line segment joining them, and let yk be the intersections with boundaries of tiled Ω, treating each resulting segment separately.] (1 point)
12. For d = 2, draw an instance of f ∗ (You can use either manual drawing or a drawing software). (2 points)
13. Finally, show that if n, the number of training samples, satisfies n É 2d−1, then the generalisation error of any learning algorithm producing fˆ will be such that
Ex∼Unif([−1,1]d)|f ∗(x)− fˆ|(x)| Ê 1/2 .
Ex∼Unif([−1,1]d)|f ∗(x)
In words, the relative generalisation error won’t go to zero unless n & 2d. [Hint: Argue in terms of the tiling you have constructed in question 9; what happens if no datapoint intersects a given tile?] (2 points)
14. Choosing d ∈ [5,13], implement this experiment, using any predictor for f ∗ you want (e.g. a Neural Net), and the Mean-Squared Error (MSE) loss. Verify that the required sample size n before your model starts generalising grows with d exponentially. For each d, draw two large datasets {xi}i=1...n, {x˜i}i=1...n with xi, x˜i ∼ Unif([−1,1]d), then draw K = 10 different target functions fk∗, k = 1...K by picking random bits in equation (0.1), and fit your model to the training data {(xi, n. Then estimate your relative generalisation error using the test set {(x˜i, y˜i = fk∗(x˜i)} (MSE error divided by standard deviation of the target function on test set), and average the performance across the K runs. You can pick n ∈ {2j; j = 5...16}. (3 points)