1 Principal Component Analysis
Download three.txt and eight.txt. Each has 200 handwritten digits. Each line is for a digit, vectorized from a 16x16 gray scale image.
1. Each line has 256 numbers: they are pixel values (0=black, 255=white) vectorized from the image as the first column (top down), the second column, and so on. Visualize the two gray scale images corresponding to the first line in three.txt and the first line in eight.txt.
2. Putting the two data files together (threes first, eights next) to form a n×D matrix X where n = 400 digits and D = 256 pixels. Note we use n × D size for X instead of D × n to be consistent with the convention in linear regression. The ith row of X is xi , where xi ∈RD is the ith image in the combined data set. Compute the sample mean . Visualize y as a 16x16 gray scale image.
3. Center X using y above. Then form the sample covariance matrix . Show the 5x5 submatrix S(1...5,1...5).
4. Use appropriate software to compute the two largest eigenvalues λ1 ≥ λ2 and the corresponding eigenvectors v1,v2 of S. For example, in Matlab one can use eigs(S,2). Show the value of λ1,λ2. Visualize v1,v2 as two 16x16 gray scale images. Hint: their elements will not be in [0, 255], but you can shift and scale them appropriately. It is best if you can show an accompany “colorbar” that maps gray scale to values.
5. Now we project (the centered) X down to the two PCA directions. Let V = [v1v2] be the D × 2 matrix. The projection is simply XV . Show the resulting two coordinates for the first line in three.txt and the first line in eight.txt, respectively.
6. ) Now plot the 2D point cloud of the 400 digits after projection. For visual interest, color points in three.txt red and points in eight.txt blue. But keep in mind that PCA is an unsupervised learning method and it does not know such class labels.
2 Q-learning
Consider the following Markov Decision Process. It has two states s. It has two actions a: move and stay. The state transition is deterministic: “move” moves to the other state, while “stay’ stays at the current state. The reward r is 0 for move, 1 for stay. There is a discounting factor γ = 0.9. 0
+1 +1
The reinforcement learning agent performs Q-learning. Recall the Q table has entries Q(s,a). The Q table is initialized with all zeros. The agent starts in state s1 = A. In any state st, the agent chooses the action at
1
according to a behavior policy at = πB(st). Upon experiencing the next state and reward st+1,rt the update is:
.
Let the step size parameter α = 0.5.
1. Run Q-learning for 200 steps with a uniformly random behavior policy: πB(st) = move or stay with 1/2 probability for any st. Show the Q table at the end.
2. Reset and repeat the above, but with an -greedy behavior policy: at each state st, with probability choose what the current Q table says is the best action: argmaxa Q(st,a); Break ties arbitrarily. Otherwise (with probability ) uniformly chooses between move and stay. Use .
3. Reset and repeat the above, but with a deterministic greedy behavior policy: at each state st use the best action at ∈ argmaxa Q(st,a) indicated by the current Q table. If there is a tie, prefer move.
4. Without doing simulation, use Bellman equation to derive the true Q table induced by the MDP.
3 Extra Credit: VC dimension [10 pts]
Let the input x ∈ X = R. Consider F = {f(x) = sgn(ax2 + bx + c) : a,b,c ∈R}, where sgn(z) = 1 if z ≥ 0, and 0 otherwise. What is V C(F)? Prove it.
4 Extra Credit: VC-dimension of Linear Separators
In this problem, you will prove that the VC-dimension of the class Hn of halfspaces (another term for linear threshold functions fw,b(x) = sign(wx + b)) in n dimensions is n + 1. We will use the following definition: The convex hull of a set of points S is the set of all convex combinations of points in S; this is the set of all points that can be written as Pxi∈S λixi, where each λi ≥ 0, and Pi λi = 1. It is not hard to see that if a halfspace has all points from a set S on one side, then the entire convex hull of S must be on that side as well.
(a) [lower bound] Prove that VC-dim(Hn) ≥ n+1 by presenting a set of n+1 points in n-dimension space such that one can partition that set with halfspaces in all possible ways, i.e., the set of points are shattered by Hn. (And, show how one can partition the set in any desired way.)
(b) [upper bound part 1] The following is Radon’s Theorem, from 1920’s.
Theorem 1. Let S be a set of n + 2 points in n dimensions. Then S can be partitioned into two (disjoint) subsets S1 and S2 whose convex hulls intersect.
Show that Radon’s Theorem implies that the VC-dimension of halfspaces is at most n + 1. Conclude that VC-dim(Hn) = n + 1.
(c) [upper bound part 2] Now we prove Radon’s Theorem. We will need the following standard fact from linearalgebra. If x1,...,xn+1 are n + 1 points in n-dimensional space, then they are linearly dependent. That is, there exist real values λ1,...,λn+1 not all zero such that λ1x1 + ... + λn+1xn+1 = 0. You may now prove Radon’s Theorem however you wish. However, as a suggested first step, prove the following. For any set of n + 2 points x1,...,xn+2 in n-dimensional space, there exist λ1,...,λn+2 not all zero such that Pi λixi = 0 and Pi λi = 0. (This is called affine dependence.)