$20
. Linear Algebra
Are the following statements true or false? If true, prove it; if false, show a counterexample.
(a) The inverse of a symmetric matrix is itself symmetric. (b) All 2 × 2 orthogonal matrices have the following form
.
(c) Let can be written as A = CCT for some matrix C.
2. Divide and Conquer Matrix Multiplication
If A is a matrix, then A2 = AA is the square of A.
(a) Show that five multiplications are sufficient to compute the square of a 2 × 2 matrix
, where a1,a2,a3,a4 are scalars.
(b) Generalize the formula in part (a) to a 2 × 2 block matrix where
A1,A2,A3,A4 are arbitrary matrices.
(c) Instead of using the classical matrix multiplication (three-loop) algorithm for computing A2, we may apply the block formula you derived in (b) to reduce 2n × 2n problems to several n × n computations, which can be tackled with classical matrix multiplication. Compare the total number of arithmetic operations. Generate 2n×2n random A matrices and plot the wall-clock time of the classical matrix multiplication algorithm and the algorithm using the formula in (b) to compute A2 for n = 4,...,10000 (or as large as your system memory allows). You can use standard packages for matrix multiplication, e.g., numpy.matmul.
(d) Show that if you have an algorithm for squaring an n×n matrix in O(nc) time, then you can use it to multiply any two arbitrary n × n matrices in O(nc) time. [Hint: Consider multiplying two matrices A and B. Can you define a matrix whose square contains AB?] 3. Probability (30 pts)
(a) Random variables X and Y have a joint distribution p(x,y). Prove the following results. You can assume continuous distributions for simplicity.
i. E[X] = EY [EX[X|Y ]]
ii. E[I[X ∈ C]] = P(X ∈ C), where I[X ∈ C] is the indicator function[1] of an arbitrary set C.
iii. var[X] = EY [varX[X|Y ]] + varY [EX[X|Y ]] iv. If X and Y are independent, then E[XY ] = E[X]E[Y ].
v. If X and Y take values in {0,1} and E[XY ] = E[X]E[Y ], then X and Y are independent.
(b) Show that the approximate randomized counting algorithm described in Lemma 1 of Lecture 2 slides (page 14) is unbiased:
En˜ = n. (1)
(c) Prove the variance formula in Lemma 2 of Lecture 2 slides (page 38) for Approximate Matrix Multiplication AB ≈ CR
. (2)
where {pk}dk=1 are sampling probabilities.
4. Positive (Semi-)Definite Matrices )
Let A be a real, symmetric d × d matrix. We say A is positive semi-definite (PSD) if, for all x ∈ Rd, xAx ≥ 0. We say A is positive definite (PD) if, for all x = 06, xAx 0. We write 0 when A is PSD, and 0 when A is PD.
The spectral theorem says that every real symmetric matrix A can be expressed A = UΛU, where U is a d × d matrix such that UU = UU = I (called an orthogonal matrix), and Λ = diag(λ1,...,λd). Multiplying on the right by U we see that AU = UΛ. If we let ui denote the ith column of U, we have Aui = λiui for each i. This expression reveals that the λi are eigenvalues of A, and the corresponding columns ui are eigenvectors associated to λi.
(a) A is PSD iff λi ≥ 0 for each i.
(b) A is PD iff λi 0 for each i. Hint: Use the following representation
d
UΛUT = XλiuiuiT .
i=1
5. Norms (20 pts)
(a) For p = 1,2,∞, verify that the functions k · kp are norms. Then, for a vector x ∈ Rn, show that
√
kxk∞ ≤ kxk2 ≤ kxk1 ≤ nkxk2 ≤ nkxk∞
and for each inequlaity, provide an example demonstrating that the inequality can be tight.
(b) For vectors x,y ∈ Rn, show that |xTy| ≤ kx||2kyk2 with equality if and only if x and y are linearly dependent. More generally, show that xTy ≤ kxk1kyk∞. Note that this implies that ; and that these are special cases of H¨older’s inequality.
(c) For A ∈ Rm×n, show that Trace(ATA) = Pij A2ij, and show that qPij Aij2 is a norm on m×n matrices. This is the Frobenius norm, denoted k·kF. Show that, in addition to satisfying the definining properties of a norm, the Frobenius norm is a submultiplicative norm, in that
kABkF ≤ kAkFkBkF
whenever the dimensions are such that the product AB is defined.
(d) Recall the definiton of the spectral norm of an m×n matrix A : kAk2 = pλmax(ATA) = σmax(A), where λmax(ATA) is the largest eigenvalue pf ATA and σmax is the largest singular value of A. Show that the Frobenius norm and the spectral norm are unitarily invariant: if U and V are unitary (orthogonal in the real case) matrices, then kUTAV kξ = kAkξ, for ξ = 2,F.
6. Approximate Matrix Multiplication (20 pts)
Here, we will consider the empirical performance of random sampling and random projection algorithms for approximating the product of two matrices. You may use Matlab, or C, or R, or any other software package you prefer to do your implementations. Please be sure to describe what you used in sufficient detail that someone else could reproduce your results. Let A be an n × d matrix, with , and consider approximating the product ATA. First, generate the matrices A from one of three different classes of distributions introduced below.
• Generate a matrix A from multivariate normal N(1d,Σ), where the (i, j)th element of Σij = 2 × 0.5|i−j|.(Refer to as GA data.)
• Generate a matrix A from multivariate t-distribution with 3 degree of freedom and covariance matrix Σ as before. (Refer to as T3 data.)
• Generate a matrix A from multivariate t-distribution with 1 degree of freedom and covariance matrix Σ as before. (Refer to as T1 data.)
To start, consider matrices of size n×d equal to 500×50. (So, you should have three matrices, one matrix A generated in each of the above ways.)
(a) For each matrix, approximate the product (ATA) with the random sampling algorithm we discussed in class, i.e., by sampling with respect to a probability distribution that depends on the norm squared of the rows of the input matrix. Plot the probability distribution. Does it look uniform or nonuniform? Plot the performance of the spectral and Frobenius norm error as a function of the number of samples.
(b) For each matrix, approximate the product (ATA) with the random sampling algorithm we discussed in class, except that the uniform distribution, rather than the norm-squared distribution, should be used to construct the random sample. Plot the performance of the spectral and Frobenius norm error as a function of the number of samples. For which matrices are the results similar and for which are they different than when the norm-squared distribution is used ?
(c) Now you will implement the matrix approximation technique on the MNIST dataset for handwritten digit classification. Details about MNIST dataset can be found at http://yann.lecun.com/exdb/mnist/. We provide the dataset in .mat file so that you can easily import it into Matlab by using load(’mnist matrix.mat’). To import the dataset in Python you can use:
import scipy.io data = scipy.io.loadmat(‘mnist matrix.mat’)
In .mat file you will find one matrix, A ∈ R60000×784. For this matrix, approximate the product (ATA) with the random sampling algorithm we discussed in class, i.e., by sampling with respect to a probability distribution that depends on the norm squared of the rows of the input matrix. Plot the probability distribution. Dos it look uniform or nonuniform? Plot the performance of the spectral and Frobenius norm error as a function of the number of samples
[1] I[X ∈C] := 1 if X ∈C and 0 otherwise