Starting from:

$30

INF553-Assingnment 5 Clustering Solved

you will implement the K-Means and Bradley-Fayyad-Reina (BFR) algorithm. The goal is to help you be familiar with clustering algorithms with various distance measurements. The datasets you are going to use are synthetic datasets.

 

1.  Assignment Requirements
2.1 Programming Language and Library Requirements

a. You must use Python to implement all the tasks. You can only use standard Python libraries (i.e., external libraries like numpy or pandas are not allowed). Spark RDD is optional for Python. If you want to use Spark, please specify the following environment in your code:

 

a.      Spark DataFrame and DataSet are not allowed.

 

2.2  Programming Environment

Python 3.6, Scala 2.11, and Spark 2.3.0

We will use Vocareum to automatically run and grade your submission. You must test your scripts on the local machine and the Vocareum terminal before submission.

 

 

 

2.  Dataset
Since the BFR algorithm has a strong assumption that the clusters are normally distributed with independent dimensions, we have generated synthetic datasets by initializing some random centroids and creating data points with these centroids and some standard deviations to form the clusters. We have also added some data points as outliers. The “cluster” number of these outliers is represented by -1 (i.e., no clusters). Figure 1 shows an example of the data points (in CSV format). The first column is the data point index. The rest columns represent the features/dimensions of the data point.

 

Figure 1: An example of the data points with 3 dimensions

You can access and download the following datasets either under the directory on Vocareum: resource/asnlib/publicdata/ or Google Drive (USC email only): https://drive.google.com/open?id=1rQJbHRBewL_W5VQuQiblsW1t1DR4k3YB

a.      Folder test1 (or test2) contains multiple files of data points. We will treat these files as separate data chunks. In each iteration, you will load one file (one chunk of points) to the memory and process these data points with the BFR algorithm.

b.      Files cluster1.json and cluster2.json provide the ground truth cluster for the data points in test1 and test2. The key is the data point index (as string). The value is its corresponding cluster index. The cluster of outliers are represented as -1.

c.      We have generated 10 testing sets using similar method (two of them are provided here, i.e., test1 and test2). Notice that the number of the dimensions, the number of the files, and the number of the data points for each dataset could vary. 

 

3.  Task
You need to submit the following files on Vocareum: (all lowercase) a. [REQUIRED] Python scripts: bfr.py

b.      [REQUIRED FOR SCALA] Scala scripts: bfr.scala; one Jar package: hw5.jar 

c.      [OPTIONAL] You can include other scripts to support your programs (e.g., callable functions).

 

4.1 Task description

You will write the K-Means and Bradley-Fayyad-Reina (BFR) algorithms from scratch. You should implement K-Means as the main-memory clustering algorithm that you will use in BFR. You will iteratively load the data points from a file and process these data points with the BFR algorithm. See below pseudocode for your reference.

 

In BFR, there are three sets of points that you need to keep track of: Discard set (DS), Compression set (CS), Retained set (RS). For each cluster in the DS and CS, the cluster is summarized by:

N: The number of points

SUM: the sum of the coordinates of the points 

SUMSQ: the sum of squares of coordinates  The conceptual steps of the BFR algorithm: Please refer to the slide.

The implementation details of the BFR algorithm: Please refer to the section 7 Appendix.

 

4.2 Execution commands

Python command: $ python3 bfr.py <input_path <n_cluster <out_file1 <out_file2

Scala command: $ spark-submit --class bfr hw5.jar <input_ path <n_cluster <out_file1 <out_file2

Param
<input_path: the folder containing the files of data points

<n_cluster: the number of clusters

<out_file1: the output file of cluster results

<out_file2: the output file of intermediate results
 

4.3 Output format

a.      You must write your clustering results in the JSON format (see Figure 2). The key is the data point index (as string). The value is its corresponding cluster index.

 

Figure 2: An example of the output clustering results

b.      You must output the intermediate results in the CSV format (see Figure 3). Each line represents the following components in each iteration: “round id”, “the number of clusters in the discard set”, “the total number of the discarded points”, “the number of clusters in the compression set”, “the total number of the compressed points”, and “the number of points in the retained set”.  The total number of rounds must be the number of data chunks in the folder. The total number of the discarded points and compressed points are accumulated with iterations.

 

Figure 3: An example of the intermediate results

More products