Starting from:

$25

MATH5473-Homework 5 Solved

1.    RPCA: Construct a random rank-r matrix: let A ∈ Rm×n with aij ∼ N(0,1) whose top-r singular value/vector is λi, ui ∈Rm and vi ∈Rn (i = 1,...,r), define . Construct a sparse matrix E with p percentage (p ∈ [0,1]) nonzero entries distributed uniformly. Then define

M = L + E.

(a)     Set m = n = 20, r = 1, and p = 0.1, use Matlab toolbox CVX to formulate a semidefinite program for Robust PCA of M:

                                                                              (trace(W1) + trace(W2)) + λkSk1                                                                  (1)

 

where you can use the matlab implementation in lecture notes as a reference;

(b)    Choose different parameters p ∈ [0,1] to explore the probability of successful recover;

(c)     Increase r to explore the probability of successful recover;

(d)    ? Increase m and n to values beyond 50 will make CVX difficult to solve. In this case, use the Augmented Lagrange Multiplier method, e.g. in E. J. Candes, X. Li, Y. Ma, and J. Wright (2009) “Robust Principal Component Analysis?”. Journal of ACM, 58(1), 1-37. Make a code yourself (just a few lines of Matlab or Python) and test it for m = n = 1000. A convergence criterion often used can be    for example).

2.    SPCA: Define three hidden factors:

 ,

where V1,V2, and  are independent. Construct 10 observed variables as follows

 ,

with j = 1 for i = 1,...,4, j = 2 for i = 5,...,8, and j = 3 for i = 9,10 and   independent for j = 1,2,3, i = 1,...,10.

The first two principal components should be concentrated on (X1,X2,X3,X4) and (X5,X6,X7,X8), respectively. This is an example given by H. Zou, T. Hastie, and R. Tibshirani, Sparse principal component analysis, J. Comput. Graphical Statist., 15 (2006), pp. 265-286.

1

Homework 5. SDP Extensions of PCA and MDS                                                                                                                   2

(a)    Compute the true covariance matrix Σ (and the sample covariance matrix with n examples, say n = 1000);

(b)    Compute the top 4 principal components of Σ using eigenvector decomposition (byMatlab or R);

(c)    Use Matlab CVX toolbox to compute the first sparse principal component by solving the SDP problem

                                                                                          max       trace(ΣX) − λkXk1

                                                                                            s.t.        trace(X) = 1

 

Choose λ = 0 and other positive numbers to compare your results with normal PCA;

(d)    Remove the first sparse PCA from Σ and compute the second sparse PCA with the samecode;

(e)     Again compute the 3rd and the 4th sparse PCA of Σ and compare them against thenormal PCAs.

(f)      ? Construct an example with 200 observed variables which is hard to deal with by CVX. In this case, try the Augmented Lagrange Multiplier method by Allen Yang et al. (UC Berkeley) whose Matlab codes can be found at http://www.eecs.berkeley.edu/ ~yang/software/SPCA/SPCA_ALM.zip, or Python scikit-learn Sparse PCA package.

3.    ?Protein Folding: Consider the 3D structure reconstruction based on incomplete MDS with uncertainty. Data file: http://yao-lab.github.io/data/protein3D.zip

 

Figure 1: 3D graphs of file PF00018 2HDA.pdf (YES HUMAN/97-144, PDB 2HDA)

In the file, you will find 3D coordinates for the following three protein families:

PF00013 (PCBP1 HUMAN/281-343, PDB 1WVN),

Homework 5. SDP Extensions of PCA and MDS                                                                                                                   3

PF00018 (YES HUMAN/97-144, PDB 2HDA), and PF00254 (O45418 CAEEL/24-118, PDB 1R9H).

For example, the file PF00018  2HDA.pdb contains the 3D coordinates of alpha-carbons for a particular amino acid sequence in the family, YES HUMAN/97-144, read as

VALYDYEARTTEDLSFKKGERFQIINNTEGDWWEARSIATGKNGYIPS where the first line in the file is

97 V 0.967 18.470 4.342

Here

•    ‘97’: start position 97 in the sequence

•    ‘V’: first character in the sequence

•    [x,y,z]: 3D coordinates in unit ˚A.

Figure 1 gives a 3D representation of its structure.

Given the 3D coordinates of the amino acids in the sequence, one can computer pairwise distance between amino acids, [dij]l×l where l is the sequence length. A contact map is defined to be a graph Gθ = (V,E) consisting l vertices for amino acids such that and edge (i,j) ∈ E if dij ≤ θ, where the threshold is typically θ = 5˚A or 8˚A here.

Can you recover the 3D structure of such proteins, up to an Euclidean transformation (rotation and translation), given noisy pairwise distances restricted on the contact map graph Gθ, i.e. given noisy pairwise distances between vertex pairs whose true distances are no more than θ? Design a noise model (e.g. Gaussian or uniformly bounded) for your experiments.

When θ = ∞ without noise, classical MDS will work; but for a finite θ with noisy measurements, SDP approach can be useful. You may try the matlab package SNLSDP by Kim-Chuan Toh, Pratik Biswas, and Yinyu Ye, or the facial reduction speed up by Nathan Krislock and Henry Wolkowicz.

More products