Starting from:

$30

CS613- Assignment 3: Dimensionality Reduction & Clustering Solved

Introduction
In this assignment you’ll work on visualizing data, reducing its dimensionality and clustering it.

You may not use any functions from machine learning library in your code, however you may use statistical functions. For example, if available you MAY NOT use functions like

pca
k-nearest neighbors functions
Unless explicitly told to do so. But you MAY use basic statistical functions like:

std
mean
cov
eig
Table 1: Grading Rubric

DataSets
Labeled Faces in the Wild Datasaet This dataset consists of celebrities download from the Internet from the early 2000s. We use the grayscale version from sklearn.datasets.

we will download the images in a specific way as shown below. You will have 3,023 images, each 87x65 pixels large, belonging to 62 different people.

from sklearn . datasets import fetch lfw people import matplotlib . pyplot as plt import matplotlib .cm as cm

people = fetch lfw people ( min faces per person =20, resize =0.7) image shape = people . images [ 0 ] . shape

fig , axes = plt . subplots (2 , 5 , fig si ze =(15, 8) , subplot kw={’xticks ’ : () , ’ yticks ’ : ()})

for            target , image , ax in             zip ( people . target ,         people . images ,            axes . ravel ()):

ax . imshow(image , cmap=cm. gray ) ax . s e t  t i t l e ( people . target names [ target ])

1             Theory Questions
Consider the following data:
−2

−  5

−3

  0  −8

 −2



  1



 5 −1



6
1 

−4

1 

3 

11

5 

0  −1

−3

1
Find the principle components of the data (you must show the math, including how youcompute the eivenvectors and eigenvalues). Make sure you standardize the data first and that your principle components are normalized to be unit length. As for the amount of detail needed in your work imagine that you were working on paper with a basic calculator. Show me whatever you would be writing on that paper. (5pts).
Project the data onto the principal component corresponding to the largest eigenvaluefound in the previous part (3pts).
Consider the following data:
−2



 5

Class 1 = −3



0



−8
1 

−4

1  , Class 2 =

3 

11
5 

0 

−1

−3

1
Compute the information gain for each feature. You could standardize the data overall,although it won’t make a difference. (5pts).
Which feature is more discriminating based on results in part a (1pt)?
Using LDA, find the direction of projection (you must show the math, however for thisone you don’t have to show the computation for finding the eigenvalues and eigenvectors). Normalize this vector to be unit length (5pts).
Project the data onto the principal component found in the previous part (3pts).
Does the projection you performed in the previous part seem to provide good class separation? Why or why not (1pt)?
2             Dimensionality Reduction via PCA
Import the data as shown above. This the labeled faces in the wild dataset. Verify that you have the correct number of people and classes

print (” people . images . shape : {}”. format ( people . images . shape )) print (”Number of classes : {}”. format ( len ( people . target names )))

people . images . shape :       (3023 , 87 , 65) Number of            classes : 62

This dataset is skewed toward George W. Bush and Colin Powell as you can verify here

# count how often each target         appears counts = np. bincount ( people . target )

# print counts next to target names for i , (count , name) in enumerate( zip ( counts , people . target names )):

print (”{0:25} {1:3}”. format (name, count ) , end=’ ’) i f ( i + 1) % 3 == 0: print ()

To make the data less skewed, we will only take up to 50 images of each person (otherwise, the feature extraction would be overwhelmed by the likelihood of George W. Bush):

mask = np. zeros ( people . target . shape ,                   dtype=np. bool )

for       target          in np. unique ( people . target ):

mask[np. where( people . target == target ) [ 0 ] [ : 5 0 ] ] = 1

Xpeople = people . data [mask] ypeople = people . target [mask]

# scale the grayscale values to be between 0 and 1 # instead of 0 and 255 for better numeric stability X people = X people / 255.

We are now going to compute how well a KNN classifier does using just the pixels alone.

from sklearn . neighbors import KNeighborsClassifier # split     the data into     training and test        sets

X train ,          X test ,       y train ,              y test = train test split (

X people , y people , stratify=y people , random state=0) # build a KNeighborsClassifier using one neighbor knn = KNeighborsClassifier ( n neighbors=1) knn . f i t ( X train , y train ) print (”Test set score of 1−nn: {:.2 f }”. format (knn . score ( X test , y test )))

You should have an accuracy around 23% - 27%.

Once you have your setup complete, write a script to do the following:

Write your own version of KNN (k=1) where you use the SSD (sum of squared differences) tocompute similarity
Verify that your KNN has a similar accuracy as sklearn’s version
Standardize your data (zero mean, divide by standard deviation)
Reduces the data to 100D using PCA
Compute the KNN again where K=1 with the 100D data. Report the accuracy
Compute the KNN again where K=1 with the 100D Whitened data. Report the accuracy
Reduces the data to 2D using PCA
Graphs the data for visualization
Recall that although you may not use any package ML functions like pca, you may use statistical functions like eig.

Your graph should end up looking similar to Figure 1 (although it may be rotated differently, depending how you ordered things).

Figure 1: 2D PCA Projection of data

3             Eigenfaces
Import the data as shown above. This the labeled faces in the wild dataset.

Use the X train data from above. Let’s analyze the first and second principal components.

Write a script that:

Imports the data as mentioned above.
Standardizes the data.
Performs PCA on the data (again, although you may not use any package ML functions likepca, you may use statistical functions like eig). No need to whiten here.
Find the max and min image on PC1’s axis. Find the max and min of PC2. Plot and reportthe faces, what variation do these components capture?
Visualizes the most important principle component as a 87x65 image (see Figure 2).
Reconstructs the X train[0,:] image using the primary principle component. To best see the full re-construction, “unstandardize” the reconstruction by multiplying it by the original standard deviation and adding back in the original mean.
Determines the number of principle components necessary to encode at least 95% of the information, k.
Reconstructs the X train[0,:] image using the k most significant eigen-vectors (found in the previous step, see Figure 4). For the fun of it maybe even look to see if you can perfectly reconstruct the face if you use all the eigen-vectors! Again, to best see the full re-construction, “unstandardize” the reconstruction by multiplying it by the original standard deviation and adding back in the original mean.
Your principle eigenface should end up looking similar to Figure 2.

Figure 2: Primary Principle Component

Your principal reconstruction should end up looking similar to Figure 3.

Figure 3: Reconstruction of first person

Your 95% reconstruction should end up looking similar to Figure 4.

Figure 4: Reconstruction of first person)

4             Clustering
Let’s implement our own version of k-means to cluster data!

Once you have your setup complete as shown above , write a script to do the following:

Write your own version of K-means clustering where you use the L2 distance to computesimilarity
Standardize your data (zero mean, divide by standard deviation)
Reduces the data to 100D using PCA.
Run K-means clustering with K = 10.
Report the number of images within each cluster.
Reconstruct the cluster centers for each of the K clusters. You will have to rotate the clustercenters back to the original space to visualize. Report these images.
Find the image closest to the cluster center, and furthest from the cluster center and reportthese images. Again, you will have to rotate the images centers back to the original space to visualize.
Implementation Details

Seed your random number generator with zero (do this right before running your k-means).
Randomly select k observations and use them for the initial reference vectors. I suggest you use randomize the indices and use the first k randomized indices to select the observations.
Use the L2 distance (Euclidean) to measure the distance between observations and referencevectors.
Terminate the training process when the sum of magnitude of change of the cluster centers (fromthe previous iteration to the current one) is less than. That is, when
1) where k is the number of clusters, ai(t) is the reference vector for cluster i at time t and d(x,y) is the L1 distance (Manhattan) between vectors x and y (as defined in the Similarity and Distance Functions link on BBlearn), or when you’ve hit 10,000 iterations.

More products