$25
In this project, you are going to implement two unsupervised learning techniques. In the first task, you are asked to implement k-means clustering algorithm using the data provided in kmeans_data.zip folder. In the second task, you need to implement PCA and apply dimensionality reduction on the data provided in USPS.mat.
Task 1
Please download kmeans_data.zip. In this problem, ground truth cluster assignments are given in labels.npy. Please do the following.
1. Plot the data using scatter plot. Assign different colors to different classes.
2. Implement k-means clustering algorithm by yourself using the number of iterations as the stopping condition. You can use built-in functions only for side-tasks such as norm computation, minimum element search and mean calculation, not for the clustering itself.
3. Run k-means 9 times with number of iterations N = {1,2,...,9}. Plot the final clustering assignments as a scatter plot for each run as 3x3 matplotlib subplot. Visually investigate the effect of the number of iterations on obtaining the optimal clustering and find the convergence point by comparing the plots with the one in Task 1.1. If the model does not converge at 9 iterations, you can select 9 other N to effectively show the progress of the clustering.
For a fair comparison, start each run with the same initial random assignments. You can use np.random.seed(1) to this purpose.
Task 2
Please load the whole dataset in USPS.mat into Python using the function loadmat in Scipy.io. The matrix A contains all the images of size 16 by 16. Each of the 3000 rows in A corresponds to the image of one handwritten digit (between 0 and 9). Please do the following.
1. Implement PCA and apply it to the data using d = 50,100,200,300 principal components. You are not allowed to use an existing implementation. You can use existing packages for eigen-decomposition. Do not forget to standardize the data before eigen-decomposition.
2. Reconstruct images using the selected principal components from Task 2.1
1
3. Visualize the reconstructed images for the images at indices i = 0,500,1000,2000 for d = 50,100,200,300. Create a 4x5 subplot where the rows correspond to images at each index, first four columns correspond to reconstructed images using each d and the last column is the raw image, i.e. before PCA. Comment on your results.