$34.99
In this homework, the superscript of a symbol xi denotes the index of samples (not raising to ith power); this is a convention in this class. Please follow the homework submission instructions in the syllabus.
1 Concept questions [25 points]
Please provide a brief answer to each question.
1. (5 points) What’s the main difference between supervised and unsupervised learning? Give one benefit and drawback for supervised and unsupervised learning, respectively.
2. (5 points) Will different initializations for k-means lead to different results?
3. (5 points) Give a short proof (can be in words but using correct logic) of why k-means algorithm will converge in a finite number of iterations.
4. (5 points) What is the main difference between k-means and generalized k-means algorithm? Explain how the choice of similarity/dissimilarity/distance will impact the result.
5. (5 points) Consider the following simple graph
Write down the graph Laplacian matrix and find the eigenvectors associated with the zero eigenvalues. Explain how you find out the number of disconnected clusters in the graph and identify these disconnected clusters using these eigenvectors.
2 Math of k-means clustering [5 points]
Given m data points xi, i = 1,...,m, K-means clustering algorithm groups them into k clusters by minimizing the distortion function over {rij,µj}
m k
J = XXrij∥xi − µj∥2, (1)
i=1 j=1
where rij = 1 if xi belongs to the j-th cluster and rij = 0 otherwise.
1. (3 points) Derive mathematically that using the squared Euclidean distance ∥xi − µj∥2 as the dissimilarity function, the centroid that minimizes the distortion function J for given assignments rij are given by
.
That is, µj is the center of j-th cluster.
2. (2 points) Derive mathematically what should be the assignment variables rij be to minimize the distortion function J, when the centroids µj are fixed.
3 Image compression using clustering [20 points]
In this programming assignment, you are going to apply clustering algorithms for image compression. This can also be viewed as an example of segmenting colors in an automated fashion using K-means clustering.
Your task is to implement K-means for this purpose. It is required you implement the algorithms yourself rather than calling k-means from a package. However, it is ok to use standard packages for supplementary tasks, e.g., file i/o, linear algebra, and visualization.
Formatting instruction
As a starting point, we suggest the following input/output signature for your k-means algorithm.
Input
• pixels: the input image representation. Each row contains one data point (pixel). For image dataset, it contains 3 columns, each column corresponding to Red, Green, and Blue components. Each component has an integer value between 0 and 255.
• k: the number of desired clusters.
Output
• class: cluster assignment of each data point in pixels. The assignment should be 1, 2, 3, etc. For k = 5, for example, each cell of the class should be either 1, 2, 3, 4, or 5. The output should be a column vector with size(pixels, 1) elements.
• centroid: location of k centroids (or representatives) in your result. With images, each centroid corresponds to the representative color of each cluster. The output should be a matrix with K rows and 3 columns. The range of values should be [0, 255], possibly floating point numbers.
Hand-in
Both of your code and report will be evaluated. Upload the code as a zip file, and the report as a pdf, separately from the zip file. In your report, answer the following questions:
1. (15 points) Use k-means with squared-ℓ2 norm as a metric for hestain.bmp and football.bmp and also choose a third picture of your own to work on. We recommend the size of 320 × 240 or smaller. Run your k-means implementation with these pictures, with several different k = 2,3,4,5,6.
Comment: hestain.png is an image of tissue stained with hematoxylin and eosin (H&E). This staining method helps pathologists distinguish between tissue types that are stained blue-purple and pink. Then the algorithm will segment the image into k regions in the RGB color space. For each pixel in the input image, the algorithm returns a label corresponding to a cluster.
Run your k-means implementation (with squared-ℓ2 norm) with random initialization centroids. Please try multiple times and report the best one for each k (in terms of image quality).
2. (5 points) Please report how long it takes to converge for each k (report the number of iterations and elapsed time in seconds).
3. (5 points) Describe a method to find the best k. What is your best k?
Note
• We recommend you test your code with several different pictures so that you can detect some problems that might happen occasionally.
4 MNIST Dataset clustering [20 points]
This question is to compare different classifiers and their performance for multi-class classifications on the complete MNIST dataset at http://yann.lecun.com/exdb/mnist/. You can find the data file mnist 10digits.mat in the homework folder. The MNIST database of handwritten digits has a training set of 60,000 examples and a test set of 10,000 examples. Use the number of clusters K = 10.
We suggest you “standardize” the features (pixels in this case) by dividing the values of the features by 255 (thus mapping the range of the features from [0, 255] to [0, 1]).
for the cluster i.
5 Political blogs dataset [30 points]
We will study a political blog dataset first compiled for the paper Lada A. Adamic and Natalie Glance, “The political blogosphere and the 2004 US Election”, in Proceedings of the WWW-2005 Workshop on the Weblogging Ecosystem (2005). It is assumed that blog-site with the same political orientation are more likely to link to each other, thus, forming a “community” or “cluster” in a graph. In this question, we will see whether or not this hypothesis is likely to be true based on the data.
• The dataset nodes.txt contains a graph with n = 1490 vertices (“nodes”) corresponding to political blogs.
We will treat the network as an undirected graph; thus, when constructing the adjacency matrix, make it symmetrical by, e.g., set the entry in the adjacency matrix to be one whether there is an edge between the two nodes (in either direction).
In addition, each vertex has a 0-1 label (in the 3rd column of the data file) corresponding to the true political orientation of that blog. We will consider this as the true label and check whether spectral clustering will cluster nodes with the same political orientation as possible.
1. (15 points) Use spectral clustering to find the k = 2,5,10,25 clusters in the network of political blogs (each node is a blog, and their edges are defined in the file edges.txt). Find majority labels in each cluster for different k values, respectively. For example, if there are k = 2 clusters, and their labels are {0,1,1,1} and {0,0,1} then the majority label for the first cluster is 1 and for the second cluster is 0. It is required you implement the algorithms yourself rather than calling from a package.
Now compare the majority label with the individual labels in each cluster, and report the mismatch rate for each cluster, when k = 2,5,10,25. For instance, in the example above, the mismatch rate for the first cluster is 1/4 (only the first node differs from the majority), and the second cluster is 1/3.
2. (15 points) Tune your k and find the number of clusters to achieve a reasonably small mismatch rate. Please explain how you tune k and what is the achieved mismatch rate. Please explain intuitively what this result tells about the network community structure.