Starting from:

$30

ISYE6740- Homework 1 Solved

In this homework, the superscript of a symbol xi denotes the index of samples (not raising to ith power); this is a convention in this class.

1           K-means clustering 
Given m data points xi, i = 1,...,m, K-means clustering algorithm groups them into k clusters by minimizing the distortion function over {rij,µj}

                                                                                          m        k

                                                                                        J = XXrijkxi − µjk2,                                                                                  (1)

i=1 j=1

where rij = 1 if xi belongs to the j-th cluster and rij = 0 otherwise.

(10 points) Derive mathematically that using the squared Euclidean distance kxi − µjk2 as the dissimilarity function, the centroid that minimizes the distortion function J for given assignments rij are given by
.

That is, µj is the center of j-th cluster.

Hint: You may start by taking the partial derivative of J with respect to µj, with rij fixed.

(10 points) Derive mathematically what should be the assignment variables rij be to minimize the distortion function J, when the centroids µj are fixed.
(10 points) Write down a pseudocode for K-means algorithm here, based on your derived results.
(10 points) Explain why K-means algorithm converges to a local optimum in finite steps.
(20 points) Calculate k-means by hands using Euclidean distance, i.e., the set up in Equation (1). Given 5 data points configuration in Figure 1. Assume k = 2. Assuming the initialization of centroid as shown.(15 points) Complete the following table for all iterations until the algorithm converges.
Iteration No.
µ1
µ2
r11
r21
r31
r41
r51
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
(5) points How many iterations it takes for k-means to converge?
Figure 1: K-means.

2           Image compression using clustering
In this programming assignment, you are going to apply clustering algorithms for image compression. Your task is implementing K-means for this purpose. It is required you implementing the algorithms yourself rather than calling from a package.

Formatting instruction

Input
pixels: the input image representation. Each row contains one data point (pixel). For image dataset, it contains 3 columns, each column corresponding to Red, Green, and Blue component. Each component has an integer value between 0 and 255.
k: the number of desired clusters. Too high value of K may result in empty cluster error. Then, you need to reduce it.
Output
class: cluster assignment of each data point in pixels. The assignment should be 1, 2, 3, etc. For k = 5, for example, each cell of class should be either 1, 2, 3, 4, or 5. The output should be a column vector with size(pixels, 1)
centroid: location of k centroids (or representatives) in your result. With images, each centroid corresponds to the representative color of each cluster. The output should be a matrix with K rows and 3 columns. The range of values should be [0, 255], possibly floating point numbers.
Hand-in
Both of your code and report will be evaluated. Upload them together as a zip file. In your report, answer to the following questions:

(30 points) Compress pictures using k-means, for bmp and football.bmp and also choose a third picture of your own to work on. We recommend size of 320 × 240 or smaller. Run your k-means implementation with these pictures, with several different k = 2,4,8,16. How long does it take to converge for each k (report the number of iterations, as well as actual running time)? Please write in your report, and also include the resulted compressed pictures for each k.
(10 points) Run your k-means implementation with different initialization centroids. How does this it affect your final result? (We usually randomize initial location of centroids in general. To answer this question, an intentional poor assignment may be useful.) Please write in your report.
Note
You may see some error message about empty clusters when you use too large k. Your implementation should treat this exception as well. That is, do not terminate even if you have an empty cluster, but use smaller number of clusters in that case.
We recommend you to test your code with several different pictures so that you can detect some problems that might happen occasionally.
If we detect copy from any other student’s code or from the web, you will not be eligible for any credit for the entire homework, not just for the programming part. Also, directly calling built-in functions or from other package functions is not allowed.

More products