$25
1. Phase transition in PCA “spike” model: Consider a finite sample of n i.i.d vectors x1,x2,...,xn drawn from the p-dimensional Gaussian distribution N(0,σ2Ip×p + λ0uuT ), where λ0/σ2 is the signal-to-noise ratio (SNR) and u ∈ Rp. In class we showed that the largest eigenvalue λ of the sample covariance matrix Sn
pops outside the support of the Marcenko-Pastur distribution if
or equivalently, if
SNR .
√ √ 2, that is, λ0 can be “buried” well inside the support Marcenko(Notice that γ < (1 + γ)
Pastur distribution and still the largest eigenvalue pops outside its support). All the following questions refer to the limit n → ∞ and to almost surely values:
√
(a) Find λ given SNR > γ.
(b) Use your previous answer to explain how the SNR can be estimated from the eigenvaluesof the sample covariance matrix.
(c) Find the squared correlation between the eigenvector v of the sample covariance matrix (corresponding to the largest eigenvalue λ) and the “true” signal component u, as a function of the SNR, p and n. That is, find |hu,vi|2.
(d) Confirm your result using MATLAB, Python, or R simulations (e.g. set u = e; and choose σ = 1 and λ0 in different levels. Compute the largest eigenvalue and its associated eigenvector, with a comparison to the true ones.)
1
Homework 2. Random Matrix Theory and PCA 2
2. Exploring S&P500 Stock Prices: Take the Standard & Poor’s 500 data:
https://github.com/yao-lab/yao-lab.github.io/blob/master/data/snp452-data.mat which contains the data matrix X ∈ Rp×n of n = 1258 consecutive observation days and p = 452 daily closing stock prices, and the cell variable “stock” collects the names, codes, and the affiliated industrial sectors of the 452 stocks. Use Matlab, Python, or R for the following exploration.
(a) Take the logarithmic prices Y = logX;
(b) For each observation time t ∈ {1,...,1257}, calculate logarithmic price jumps
∆Yi,t = Yi,t − Yi,t−1, i ∈ {1,...,452};
(c) Construct the realized covariance matrix Σˆ ∈ R452×452 by,
1257
;
τ=1
(d) Compute the eigenvalues (and eigenvectors) of Σ and store them in a descending orderˆ by {λˆk,k = 1,...,p}.
(e) Horn’s Parallel Analysis: the following procedure describes a so-called Parallel Analysis of PCA using random permutations on data. Given the matrix [∆Yi,t], apply random permutations πi : {1,...,t} → {1,...,t} on each of its rows: ∆Y˜i,πi(j) such that
∆Y1,1 ∆Y
2,π2(1)
[∆Y˜π(i),t] = ∆Y3,π3(1)
...
∆Yn,πn(1)
∆Y1,2
∆Y2,π2(2)
∆Y3,π3(2) ...
∆Yn,πn(2)
∆Y1,3
∆Y2,π2(3)
∆Y3,π3(3) ...
∆Yn,πn(3)
...
...
... ... ...
∆Y1,t
∆Y2,π2(t)
∆Y3,π3(t) .
...
∆Yn,πn(t)
Define as the null covariance matrix. Repeat this for R times and compute the eigenvalues of Σ˜r for each 1 ≤ r ≤ R. Evaluate the p-value for each estimated eigenvalue λˆk by (Nk+1)/(R+1) where Nk is the counts that λˆk is less than the k-th largest eigenvalue of Σ˜r over 1 ≤ r ≤ R. Eigenvalues with small p-values indicate that they are less likely arising from the spectrum of a randomly permuted matrix and thus considered to be signal. Draw your own conclusion with your observations and analysis on this data. A reference is: Buja and Eyuboglu, ”Remarks on Parallel Analysis”, Multivariate Behavioral Research, 27(4): 509-540, 1992.
3. *Finite rank perturbations of random symmetric matrices: Wigner’s semi-circle law (proved by Eugene Wigner in 1951) concerns the limiting distribution of the eigenvalues of random symmetric matrices. It states, for example, that the limiting eigenvalue distribution of n × n symmetric matrices whose entries wij on and above the diagonal (i ≤ j) are i.i.d Gaussians
) (and the entries below the diagonal are determined by symmetrization, i.e., wji = wij) is the semi-circle:
,
where the distribution is supported in the interval [−1,1].
Homework 2. Random Matrix Theory and PCA 3
(a) Confirm Wigner’s semi-circle law using MATLAB, Python, or R simulations (take, e.g.,n = 400).
(b) Find the largest eigenvalue of a rank-1 perturbation of a Wigner matrix. That is, findthe largest eigenvalue of the matrix
W + λ0uuT ,
where W is an n × n random symmetric matrix as above, and u is some deterministic unit-norm vector. Determine the value of λ0 for which a phase transition occurs. What is the correlation between the top eigenvector of W + λ0uuT and the vector u as a function of λ0? Use techniques similar to the ones we used in class for analyzing finite rank perturbations of sample covariance matrices.
[Some Hints about homework] For Wigner Matrix ), the answer is