Starting from:

$30

DSCI553-Assignment 5 Clustering Solved

1. Overview of the Assignment
In Assignment 5, you will implement the K-Means and Bradley-Fayyad-Reina (BFR) algorithm. Doing the assignment helps you get familiar with clustering algorithms using various distance measurements. The datasets you are going to use are synthetic.2.2 Programming Environment 

3. Dataset
Since the BFR algorithm has a strong assumption that the clusters are normally distributed with independent dimensions, we generated synthetic datasets by the following steps: 1) Initializing some random centroids 2) For each centroid, sampling data points following the normal distribution in each dimension. Together, the mean of the normal distribution in each dimension is the centroid of the clusters, and the standard deviation is set manually 3) For each centroid, we added some data points as the outliers. Figure 1 shows an example of the data points (in CSV format). The first column is the data point indexes. The other columns represent the features/dimensions of the data points. 

Figure 1: An example of the data points with 3 dimensions 
You can access and download the following datasets either under the directory (resource/asnlib/publicdata/) on Vocareum or Google Drive (USC email only): https://drive.google.com/drive/folders/1mZgx9vHpsjwtjwbZGdnVjrVypKoZJaP9?usp=sharing

a. Five folders are named from “test1” to “test5”. Each folder contains multiple files storing data points. We will treat these files as separate data chunks. In each iteration, you will load one file (one chunk of points) to the memory and process these data points with the BFR algorithm.

b. Five files are named from “cluster1.json” to “cluster5.json”. Each file provides the ground truth cluster for the data points in the corresponding folder from “test1” to “test5”. The key is the data point index (as a string). The value is its corresponding cluster index. The cluster index of the outliers is -1. The number of clusters for each dataset is test1: 10, test2: 10, test3: 5, test4: 8, test5: 15. 

c.  We have generated ten testing sets using the same method. Five of them are provided for your implementation. The rest of them are used for grading.  Notice that the number of dimensions, the number of files, and the number of data points in each dataset could vary.  


4.1 Task description
You will write the K-Means and Bradley-Fayyad-Reina (BFR) algorithms from scratch. You should implement K-Means as the in-memory clustering algorithm that you will use in BFR. You will iteratively load the data points from a file and process these data points with the BFR algorithm. See below pseudocode for your reference.

In BFR, there are three sets of points that you need to keep track of: Discard set (DS), Compression set (CS), Retained set (RS). For each cluster in the DS and CS, the cluster is summarized by:

N: The number of points

SUM: the sum of the coordinates of the points  

SUMSQ: the sum of squares of coordinates  

The conceptual steps of the BFR algorithm: Please refer to the slides. 

The implementation details of the BFR algorithm: Please refer to Section 7. 
.2 Execution commands
Python command: $ python3 bfr.py <input_path> <n_cluster> <out_file1> <out_file2>

Scala command: $ spark-submit --class bfr hw5.jar <input_ path> <n_cluster> <out_file1> <out_file2> Params               <input_path>: the folder containing the files of data points

<n_cluster>: the number of clusters

 <out_file1>: the output file of cluster results

<out_file2>: the output file of intermediate results


4.3 Output format
a.       You must write your clustering results in the JSON format (see Figure 2). The key is the data point index (as string). The value is its corresponding cluster index. The indexes for both data points and clusters start from 0.  

 

Figure 2: An example of the output clustering results 

b. You must output the intermediate results in the CSV format (use the same headers in Figure 3). Each line in the intermediate results represents the following information about each iteration/data trunk:  

-       round id (starting from 1): # of rounds must be # of data chunks in the folder.

-   the number of clusters in the discard set 

-    the total accumulated number of the discarded points: The number should only go up with iterations.

-   the number of clusters in the compression set  

- the total number of the compressed points 

- the number of points in the retained set   

7. Appendix  
The implementation details of the BFR algorithm (you can/should have your own implementation; this is only for reference). Suppose the number of clusters is K and the number of dimensions is d.

1. Load the data points from one file.

2. Run K-Means on the data points or a random subset of the data points. For the implementation, you will apply K-means on a subset since you cannot handle all data points in the first file. Initialize the algorithm with a large number of centroids (e.g., 3 or 5 times of K) and use the Euclidean distance as the similarity measurement.

3.  Among the result clusters from step 2, move all the clusters that contain only one or very few points (you define “very few”) to RS as the outliers. You will now have two groups of data points: the outlier data points and inlier data points.

4. Run the clustering algorithm again to cluster the inlier data points into K clusters. Use these K clusters as your DS. Discard these points and generate the DS statistics.

5. Run the clustering algorithm again to cluster the outlier data points using a large number of clusters (e.g., 3 or 5 times of K). Generate CS and their statistics from the clusters with more than one data point and use the remaining as your new RS.

6. The above steps finish the initialization of DS. So far, you have K DS (from step 4), some number of CS (from step 5), and some number of RS (from step 5).

7. Load the data points from another file (step 2 loads a portion of the first file, load the remaining data points from the first file).

8. For the new data points, compare them to each DS using the Mahalanobis Distance and assign them to the nearest DS clusters if the distance is < 𝛼√𝑑 (e.g., 2√𝑑).

9. For the new data points which are not assigned to any DS cluster, compare them to each of the CS using the Mahalanobis Distance and assign them to the nearest CS clusters if the distance is < 𝛼√𝑑 (e.g., 2√𝑑).

10.   For the new data points which are not assigned to any DS or CS cluster, add them to your RS.

11.   Run the clustering algorithm on the RS with a large number of centroids (e.g., 3 or 5 time of K). Generate CS and their statistics from the clusters with more than one data point and add them to your existing CS list. Use the remaining points as your new RS.

12.     Merge CS clusters that have a Mahalanobis Distance < 𝛼√𝑑 (e.g., 2√𝑑).

13.   Output your intermediate result after all the data points in the data file have been evaluated. Repeat step 6 to 12.

If this is the last run (after the last chunk of data), merge your CS and RS clusters into the closest DS clusters.

More products