Starting from:

$25

INF553 -  Foundations and Applications of Data Mining - Assignment 5 Clustering  - Solved

1.   Overview of the Assignment
In Assignment 5, you will implement the K-Means and Bradley-Fayyad-Reina (BFR) algorithm. The goal is to help you be familiar with clustering algorithms with various distance measurements. The datasets you are going to use are synthetic datasets.

 

2.   Assignment Requirements
2.1 Programming Language and Library Requirements 

a. You must use Python to implement all the tasks. You can only use standard Python libraries (i.e., external libraries like numpy or pandas are not allowed). Spark RDD is optional for Python. If you want to use Spark, please specify the following environment in your code:

  

a.       There will be 10% bonus for Scala implementation in each task. Spark RDD is required for Scala. You can get the bonus only when both Python and Scala implementations are correct.  

b.       Spark DataFrame and DataSet are not allowed.

 

2.2  Programming Environment 

Python 3.6, Scala 2.11, and Spark 2.3.0 
.

 3.   Dataset
Since the BFR algorithm has a strong assumption that the clusters are normally distributed with independent dimensions, we have generated synthetic datasets by initializing some random centroids and creating data points with these centroids and some standard deviations to form the clusters. We have also added some data points as outliers. The “cluster” number of these outliers is represented by -1 (i.e., no clusters). Figure 1 shows an example of the data points (in CSV format). The first column is the data point index. The rest columns represent the features/dimensions of the data point.

  

Figure 1: An example of the data points with 3 dimensions 

You can access and download the following datasets either under the directory on Vocareum: resource/asnlib/publicdata/ or Google Drive (USC email only): https://drive.google.com/open?id=1rQJbHRBewL_W5VQuQiblsW1t1DR4k3YB 

a.       Folder test1 (or test2) contains multiple files of data points. We will treat these files as separate data chunks. In each iteration, you will load one file (one chunk of points) to the memory and process these data points with the BFR algorithm.

b.       Files cluster1.json and cluster2.json provide the ground truth cluster for the data points in test1 and test2. The key is the data point index (as string). The value is its corresponding cluster index. The cluster of outliers are represented as -1. Both datasets have 10 clusters. 

c.       We have generated 10 testing sets using similar method (two of them are provided here, i.e., test1 and test2). Notice that the number of the dimensions, the number of the files, and the number of the data points for each dataset could vary.  

 


 

4.1 Task d

You will write the K-Means and Bradley-Fayyad-Reina (BFR) algorithms from scratch. You should implement K-Means as the main-memory clustering algorithm that you will use in BFR. You will iteratively load the data points from a file and process these data points with the BFR algorithm. See below pseudocode for your reference.

  

In BFR, there are three sets of points that you need to keep track of: Discard set (DS), Compression set (CS), Retained set (RS). For each cluster in the DS and CS, the cluster is summarized by:

N: The number of points

SUM: the sum of the coordinates of the points  

SUMSQ: the sum of squares of coordinates  The conceptual steps of the BFR algorithm: Please refer to the slide. 

The implementation details of the BFR algorithm: Please refer to the section 7 Appendix. 

 

4.2 Execution commands 

Python command: $ python3 bfr.py <input_path> <n_cluster> <out_file1> <out_file2>

Scala command: $ spark-submit --class bfr hw5.jar <input_ path> <n_cluster> <out_file1> <out_file2>

Param
<input_path>: the folder containing the files of data points

<n_cluster>: the number of clusters

<out_file1>: the output file of cluster results

<out_file2>: the output file of intermediate results
 

4.3 Output format 

a.       You must write your clustering results in the JSON format (see Figure 2). The key is the data point index (as string). The value is its corresponding cluster index.

  

Figure 2: An example of the output clustering results 

b.       You must output the intermediate results in the CSV format (see Figure 3). Each line represents the following components in each iteration: “round id” (starting from 1), “the number of clusters in the discard set”, “the total number of the discarded points”, “the number of clusters in the compression set”, “the total number of the compressed points”, and “the number of points in the retained set”.  The total number of rounds must be the number of data chunks in the folder. The total number of the discarded points are accumulated with iterations. 

  

Figure 3: An example of the intermediate results 

 

 

7.   Appendix  
The implementation details of the BFR algorithm (you can/should have your own implementation; this is only for reference). Suppose the number of clusters is K and the number of dimensions is d. a. Load the data points from one file.

b.       Run K-Means on a small random sample of the data points to initialize the K centroids using the Euclidean distance as the similarity measurement.

c.       Use the K-Means result from b to generate the DS clusters (i.e., discard points and generate statistics).

d.       The initialization of DS has finished, so far, you have K clusters in DS.

e.       Run K-Means on the rest of the data points with a large number of clusters (e.g., 5 times of K) to generate CS (clusters with more than one points) and RS (clusters with only one point).

f.        Load the data points from next file.

g.       For the new points, compare them to the clusters in DS using the Mahalanobis Distance and assign them to the nearest DS cluster if the distance is < 𝛼 .

h.       For the new points that are not assigned to DS clusters, using the Mahalanobis Distance and assign the points to the nearest CS cluster if the distance is < 𝛼 .

i.         For the new points that are not assigned to any clusters in DS or CS, assign them to RS.

j.         Merge the data points in RS by running K-Means with a large number of clusters (e.g., 5 times of K) to generate CS (clusters with more than one points) and RS (clusters with only one point).

k.       Merge clusters in CS that have a Mahalanobis Distance < 𝛼 .

l.         Repeat the steps f – k until all the files are processed.

m.     If this is the last round (after processing the last chunk of data points), merge clusters in CS with the clusters in DS that have a Mahalanobis Distance < 𝛼 .

(𝛼 is a hyper-parameter, you can choose it to be around 2, 3 or 4)

More products