$25
Weighted K-Means Clustering
In this exercise we will simulate finding good locations for production plants of a company in order to minimize its logistical costs. In particular, we would like to place production plants near customers so as to reduce shipping costs and delivery time.
We assume that the probability of someone being a customer is independent of its geographical location and that the overall cost of delivering products to customers is proportional to the squared Euclidean distance to the closest production plant. Under these assumptions, the K-Means algorithm is an appropriate method to find a good set of locations. Indeed, K-Means finds a spatial clustering of potential customers and the centroid of each cluster can be chosen to be the location of the plant.
Because there are potentially millions of customers, and that it is not scalable to model each customer as a data point in the K-Means procedure, we consider instead as many points as there are geographical locations, and assign to each geographical location a weight wi corresponding to the number of inhabitants at that location. The resulting problem becomes a weighted version of K-Means where we seek to minimize the objective:
∑iwi mink | | xi − ck | | 2
J(c1, …, cK) = ,
∑iwi
where ck is the kth centroid, and wi is the weight of each geographical coordinate xi. In order to minimize this cost function, we iteratively perform the following EM computations:
Expectation step: Compute the set of points associated to each centroid:
arg min ‖xi − ck‖2}
k
Minimization step: Recompute the centroid as a the (weighted) mean of the associated data points:
∑i∈C(k)wi ⋅ xi
: ck ←
∑i∈C(k)wi
until the objective J(c1, …, cK) has converged.
Getting started
In this exercise we will use data from http://sedac.ciesin.columbia.edu/ (http://sedac.ciesin.columbia.edu/), that we store in the files data.mat as part of the zip archive. The data contains for each geographical coordinates (latitude and longitude), the number of inhabitants and the corresponding country.
Several variables and methods are provided in the file utils.py :
utils.population A 2D array with the number of inhabitants at each latitude/longitude.
utils.plot(latitudes,longitudes) Plot a list of centroids given as geographical coordinates in overlay to the population density map.
The code below plots three factories (white squares) with geographical coordinates (60,80), (60,90),(60,100) given as input.
Also, to get a dataset of geographical coordinates associated to the image given as an array, we can use:
In [2]:
Initializing Weighted K-Means (25 P)
Because K-means has a non-convex objective, choosing a good initial set of centroids is important. Centroids are drawn from from the following discrete probability distribution:
P population(x, y)
Z
where Z is a normalization constant. Furthermore, to avoid identical centroids, we add a small Gaussian noise to the location of centroids, with standard deviation 0.01.
Task:
Implement the initialization procedure above.
utils.plot .
Implementing Weighted K-Means (75 P)
Task:
Implement the weighted K-Means algorithm. Your algorithm should run for nbit iterations and print the value of the objective after training. If verbose , it should also print the value of the objective at each iteration.
it = 1: J = 12.66 it = 2: J = 10.08 it = 3: J = 8.39 it = 4: J = 7.82 it = 5: J = 7.52 it = 6: J = 7.35 it = 7: J = 7.22 it = 8: J = 7.14 it = 9: J = 7.10 it = 10: J = 7.07 it = 11: J = 7.04 it = 12: J = 7.01 it = 13: J = 6.98 it = 14: J = 6.96 it = 15: J = 6.95 it = 16: J = 6.95 it = 17: J = 6.94 it = 18: J = 6.94 it = 19: J = 6.93 it = 20: J = 6.93 it = 21: J = 6.93 it = 22: J = 6.93 it = 23: J = 6.93 it = 24: J = 6.93 it = 25: J = 6.93 it = 26: J = 6.93 it = 27: J = 6.93 it = 28: J = 6.93 it = 29: J = 6.93 it = 30: J = 6.93 it = 31: J = 6.93 it = 32: J = 6.93 it = 33: J = 6.93 it = 34: J = 6.93 it = 35: J = 6.93 it = 36: J = 6.93 it = 37: J = 6.93 it = 38: J = 6.93 it = 39: J = 6.93 it = 40: J = 6.93 it = 41: J = 6.93 it = 42: J = 6.93 it = 43: J = 6.93 it = 44: J = 6.93 it = 45: J = 6.93 it = 46: J = 6.93 it = 47: J = 6.93 it = 48: J = 6.93 it = 49: J = 6.93 it = 50: J = 6.93
Observe that the k-means algorithm is non-convex, and arrives in local optima of different quality depending on the initialization:
In [7]:
it = 50: J = 6.57 it = 50: J = 7.37 it = 50: J = 8.33 it = 50: J = 8.10 it = 50: J = 7.77