Starting from:

$34.99

ISYE6740 Homework 1 Solution

1 Clustering 
[a-b] Given N data points xn(n = 1,...,N), K-means clustering algorithm groups them into K clusters by minimizing the distortion function over {rnk,µk}
N K
J = XXrnkkxn − µkk2,
n=1k=1
where rnk = 1 if xn belongs to the k-th cluster and rnk = 0 otherwise.
(a)  Prove that using the squared Euclidean distance kxn − µkk2 as the dissimilarity function and minimizing the distortion function, we will have
.
That is, µk is the center of k-th cluster.
(b)  Prove that K-means algorithm converges to a local optimum in finite steps.
(c) For the following data (two moons), which of these three distance metrics below
(if any) would successfully separate the two moons? Explain the reasoning Some of the most commonly used distance metrics between two clusters are:
• Single linkage: the minimum distance between any pairs of points from the two clusters, i.e.

• Complete linkage: the maximum distance between any parts of points from the two clusters, i.e.

• Average linkage: the average distance between all pair of points from the two clusters, i.e.


2 Image compression using clustering [40 points]
In this programming assignment, you are going to apply clustering algorithms for image compression. Your task is implementing the clustering parts with two algorithms: K-means and K-medoids. It is required you implementing the algorithms yourself rather than calling from a package.
K-medoids
Given N data points xn(n = 1,...,N), K-medoids clustering algorithm groups them into K clusters by minimizing the distortion function ), where D(x,y) is a distance measure between two vectors x and y in same size (in case of K-means, D(x,y) = kx − yk2), µk is the center of k-th cluster; and rnk = 1 if xn belongs to the k-th cluster and rnk = 0 otherwise. In this exercise, we will use the following iterative procedure:
• Initialize the cluster center µk, k = 1,...,K.
• Iterate until convergence:
– Update the cluster assignments for every data point xn: rnk = 1 if k = argminj D(xn,µj), and rnk = 0 otherwise.
– Update the center for each cluster k: choosing another representative if necessary.
There can be many options to implement the procedure; for example, you can try many distance measures in addition to Euclidean distance, and also you can be creative for deciding a better representative of each cluster. We will not restrict these choices in this assignment. You are encouraged to try many distance measures as well as way of choosing representatives.
Formatting instruction
Input
• pixels: the input image representation. Each row contains one data point (pixel). For image dataset, it contains 3 columns, each column corresponding to Red, Green, and Blue component. Each component has an integer value between 0 and 255.
Output
• class: cluster assignment of each data point in pixels. The assignment should be 1, 2, 3, etc. For K = 5, for example, each cell of class should be either 1, 2, 3, 4, or 5. The output should be a column vector with size(pixels, 1) elements.
• centroid: location of K centroids (or representatives) in your result. With images, each centroid corresponds to the representative color of each cluster. The output should be a matrix with K rows and 3 columns. The range of values should be [0, 255], possibly floating point numbers.
Hand-in
Both of your code and report will be evaluated. Upload them together as a zip file. In your report, answer to the following questions:
1. Within the K-medoids framework, you have several choices for detailed implementation. Explain how you designed and implemented details of your K-medoids algorithm, including (but not limited to) how you chose representatives of each cluster, what distance measures you tried and chose one, or when you stopped iteration.
2. Attach a picture of your own. We recommend size of 320 × 240 or smaller.
3. Run your K-medoids implementation with the picture you chose above, with several different K. (e.g, small values like 2 or 3, large values like 16 or 32) What did you observe with different K? How long does it take to converge for each K?
5. Repeat question 2 and 3 with K-means. Do you see significant difference between K-medoids and K-means, in terms of output quality, robustness, or running time?
Note
• We will grade using test pictures which are not provided. We recommend you to test your code with several different pictures so that you can detect some problems that might happen occasionally.
• If we detect copy from any other student’s code or from the web, you will not be eligible for any credit for the entire homework, not just for the programming part. Also, directly calling built-in functions or from other package functions is not allowed.

More products