$25
The purpose of this homework assignment is the practical validation of applying The Johnson-Lindenstrauss
Lemma. The lemma formally says that, given vectors x1,...,xn ∈ Rm, for some n,m ∈ N, and given and, then there exists a projection J from Rm to Rk that approximately preserves the
distances, that is,
, (1)
for all 1 ≤ i,j ≤ n. As stated in page 155 of the book, a key point given in the proof of the lemma is that a random projection will do for such a purpose, and therefore, there must be many such k × m projections.
Given the values of, you will first determine the value of k to be used and then generate 10 different realizations of the random projection J to be used, as indicated in page 153 of the book. Formally, every column of J will be independently generated from a 0 mean normal distribution with σ2 = 1/k. For each realization of J, you will write a code that determines whether or not condition (1) holds, for all pairs of independent vectors xi and xj, 1 ≤ i,j ≤ n. Note that the cases when i = j should not be checked. Here, you should first determine the number of all possible such pairs (let us denote it by N) and then the number of cases when (1) holds (let us denote it by Ntrue). Denoting by vi the percentage of cases that condition (1) holds for the ith realization of J, you will have
.
Then, vav will denote the average of these percentages over all realizations of J, that is,
.
So, given the values for m,n,, and the vectors xi, 1 ≤ i ≤ n, you now have a number vav. What we want to see is the accuracy of the preservation of distances over different values of. Therefore, you will find the values of vav for the cases when and. The value of n is dependent on m and you will have three such values for it, namely, n = m,m/10,m/100. The tasks you should complete are listed below. First, note that for each n and m, you will first generate the vectors xi ∈Rm, for 1 ≤ i ≤ m. You will do this by independently generating n realizations of length m vectors with i.i.d. zero-mean unit-variance normal random variables.
For each m ∈ {104,5 × 104,105}, you will plot the values of vav with respect to the values of
{0.1,0.3,0.7,0.9}. There will be three lines in the plot, each corresponding to the three different values of n as given above. Note that you will end up with a total of three figures, each containing three lines.
For each, you will plot the values of vav with respect to the values of m ∈
{104,5 × 104,105}. There will be three lines in the plot, each corresponding to the three values of n as given above. Note that you will end up with a total of four figures, each containing three lines.
Report your plots (note that there will be a total of 7 such plots, each of which contains three different lines corresponding to the values of n) and comment on the results. What trend do you see and why? How does the dimension affect the accuracy? How does the sample size and affect it?