Starting from:

$25

CSCI4390-6390 - DataMining -  Assign7 -  Clustering -  EM - Kernel KMeans - Solved

You will use the Appliances energy prediction data set. You should ignore the first attribute, which is a date-time variable, and you should also remove the last attribute, which is a duplicate of the previous one. Also, the first attribute (after removing the date-time variable), which denotes the Appliances Energy Use, will NOT be used for clustering; instead we will use it as the cluster label to assess the performance of the clustering methods. Thus, ignoring "date", "Appliances", and "rv2", the remaining attributes will be used as the data for clustering.
Note that the Appliances Energy Use attribute takes values in the range    . However, for clustering, we will group these values into 6 "true"
[10, 1080]
clusters:    is for values in the range    ,    for values    ,    for values    ,    for values    ,    for values    , and    for
c1    [10, 40] c2    [50, 50] c3    [60, 60] c4    [70, 90] c5    [100, 160]    c6 
values [170, 1080]. This will result in cluster sizes Ici I = 3094, IC21 = 4368, IC31 = 3282, IC41 = 3780, IC51 = 3003 and IC61 = 2208.
CSCI4390/6390: Expectation Maximization Clustering
Implement the Expectation-Maximization (EM) algorithm for clustering (see Algorithm 13.3 in Chapter 13). Use the 'Appliances' attribute as the true cluster label as described above, and use it for the purity-based clustering evaluation (see below). Run with    clusters.
k = 6
For initializing the clusters, you can either choose random means as described in the algorithm. Or you can shuffle the data, and pick the first points as
k
the mean. This has the advantage that the means are always points in the dataset.
For practical purposes, you may want to use the logexpsum trick for expectation step, where you compute the log probabilities so that you can deal with very small probability values, otherwise, you may find that weights of a point for the clusters are zero.
As another practical point, you can get an error when inverting the covariance matrix, so you should add a small ridge value A value along the diagonal entries to make the matrix invertible. This can be considered as a regularized estimate of the covariance matrix, i.e.,
Σ +AI
.
Your program output should consist of the following information:
 
The final mean for each cluster
The final covariance matrix for each cluster
Size of each cluster, after assigning each point to the cluster with highest posterior probability    .
P (c i ∣x j )
The 'purity score' for your clustering, computed as follows: Assume that denotes the set of points assigned to cluster by the EM algorithm, and
ci    i
let    denote the true cluster id. Purity score is defined as:
Ti
1
k
n
Tj}
ma41 {c i ∩
 
i=1
where    is the true number of clusters, and is the input number of clusters to EM. See Eq(17.1) on pg 427 for more details on the purity measure.
K    k
CSCI6390: Kernel K-Means
Also implement the Kernel K-Means algorithm 13.2 on pg 341. Use only the Gaussian kernel, but you'll have to choose the value of spread.
Your code must output the following information:
 
Size of each cluster
The Purity for your clustering

 
For CSCI6390, for Kernel Kmeans, write a script Assign7-KK.py, which will be run as
Assign7-KK.py FILENAME k EPS SPREAD
The parameters have the same meaning as given above, but SPREAD is the spread parameter for the Gaussian kernel.
Note that computing the full kernel matrix for 19K+ points will be memory intensive, so if you do not have enough memory, one option is for you to repeatedly compute the required kernel values. Alternatively, you can show results on at least 5000 points. However, you have to select these points using stratified sampling, so that you choose a proportional number of points from each cluster label. You can use StratifiedShuffleSplit from scikit-learn for stratified sampling if you wish.

More products