$29.99
100 points
Overview
In this assignment you will explore regression techniques on high-dimensional data. You will use a data set for this assignment:
• http://www.cs.utah.edu/˜jeffp/teaching/cs5140/A7/A.csv
and a file stub:
• http://www.cs.utah.edu/˜jeffp/teaching/cs5140/A7/FD.py
For python, you can use the following approach to load the data:
A = np.loadtxt(’A.csv’, delimiter=’,’)
1 Singular Value Decomposition (70 points)
First we will compute the SVD of the matrix A we have loaded
import numpy as np from scipy import linalg as LA U, s, Vt = LA.svd(A, fullmatrices=False)
Then take the top k components of A for values of k = 1 through k = 10 using
Uk = U[:,:k)
Sk = S[:k,:k]
Vtk = Vt[:k,:]
Ak = Uk @ Sk @ Vtk
A (40 points): Compute and report the L2 norm of the difference between A and Ak for each value of k using
LA.norm(A-Ak,2)
B (10 points): Find the smallest value k so that the L2 norm of A-Ak is less than 10% that of A; k might or might not be larger than 10.
C (20 points): Treat the matrix as 4000 points in 20 dimensions. Plot the points in 2 dimensions in the way that minimizes the sum of residuals squared, and describe briefly how you did it.
2 Frequent Directions and Random Projections (30 points)
Use the stub file FD.py to create a function for the Frequent Directions algorithm (Algorithm 16.2.1). Consider running this code on matrix A.
A (30 points): Measure the error maxkxk=1 |kAxk2 − kBxk2| as
LA.norm(A.T @ A - B.T @ B)
• How large does l need to be for the above error to be at most kAk2F /10?
• How does this compare to the theoretical bound (e.g. for k = 0).
• How large does l need to be for the above error to be at most kA − Akk2F /10 (for k = 2)?
Note: you can calculate kAk2F as LA.norm(A, ’fro’)ˆ2.
3 BONUS
A (10 points): Create another l x d matrix B, but using random projections. You can do this by creating an l x n√ matrix S, and letting B = SA. Fill each entry of S by an independent normal random variable
.
Estimate how large should l be in order to achieve maxkxk=1 |kAxk2−kBxk2| ≤ kAk2F /10. To estimate the relationship between l and the error in this randomized algorithm, you will need to run multiple trials. Be sure to describe how you used these multiple trials, and discuss how many you ran and why you thought this was enough trials to run to get a good estimate.
B (2 points) Consider any n × d matrix A. For some parameter t ≥ 1, assume that n > d > 100t2. Without knowing A, explain at most how many squared singular values of A could be larger than kAk2F /t.