Starting from:

$30

INF553-Assignment 4 Clustering Solved

In this assignment, you will experiment with several clustering techniques including K-Means, Bisecting K-Means, and Gaussian Mixture Models (GMM). Unsupervised data mining methods are fairly diverse. At one extreme, there are the pure statistical summarization techniques that are the mainstay of exploratory data analysis, somewhere in between are data transformations like principal components analysis, which have a deeper mathematical underpinning but a fairly mechanical formulation, and at the other extreme you have unsupervised models that make strong assumptions about the data distribution. Clustering methods fall into that latter category: they make an assumption about instance similarity given attributes and use this modeling assumption to provide an interpretable division of a dataset.

 

Environment Requirements
Python: 2.7 Scala: 2.11 Spark: 2.3.1

Algorithm 
Note: You need to develop the algorithm only for Problem 1. For Problem 2 and 3 you can use existing implementations. 

 

In this assignment, you will look at three different clustering algorithms on the Yelp Dataset Challenge Round 12. You will cluster the reviews data. For the first problem you will implement K-Means algorithm and submit the code. For Problems 2 and 3, you will run K-Means, Bisecting K-Means and Gaussian Mixture Models based clustering.  You can use existing libraries for this part of the assignment. 

Note this dataset is different from the one used in last assignment. One of the challenges in data mining is preparing the data, extracting features, and presenting results. A significant portion of this assignment focusses on those aspects.

The dataset was prepared by extracting the text from reviews in Yelp Challenge Dataset Round 12. All special characters were removed from the reviews by running ​sub(​'\W+'​, ​''​,word) and​ then the words were converted to lower case. Only reviews with more than 50 words were selected and one review is written per line. The small dataset (yelp_reviews_clustering_small) contains 1000 reviews. The large dataset (yelp_reviews_clustering_large) contains 100K reviews. 

 

Each review should be treated as a document. You will use word count (only in Problem 1) and Term Frequency – Inverse Document Frequency (TF-IDF) features (remember, this was discussed in the class). The distance measure will be Euclidean Distance. For random seed or state use 20181031. 

 

You can write your program in Python or Scala. For this assignment you will need to cluster the reviews in the Yelp dataset with which you are already familiar with. First you need to develop K-Means clustering algorithm and run it on yelp_reviews_clustering_small​ (​ Problem 1). Next you need to run K-Means, bisecting K-Means, and GMM clustering on ​yelp_reviews_clustering_small ​(Problem 2) dataset, and ​yelp_reviews_clustering_large ​(Problem 3) for large dataset. 

At the end of the assignment, Appendix A specifies how to organize your Description pdf file. 

 

 

Problem 1 Develop K-Means Clustering Algorithm
You will implement the K-Means algorithm with 1) Word count and 2) TF-IDF features from reviews using Euclidean distance measure. The algorithm should take an input file with reviews, feature, number of clusters, and maximum number of iterations. For each cluster you need to identify the top ten frequently occurring features. Note these are the terms corresponding to the top ten values in the centroid vector. 

 

Execution Requirements
Input Arguments:

1. Input.txt: This is the path to the input reviews file. Every line contains a word and reviews are separated by ‘---‘. For Problem 1 you can use the ​Data/yelp_reviews_clustering_small.txt file to test the correctness of your algorithm.

2.​  ​Feature=W​: W –for word count, T – for TF-IDF

 

3.  N=5: ​Number of clusters. 

4.​  ​Iterations =20​: Maximum number of iterations

Output:  

You should output the sizes of the clusters followed by the top 10 frequent terms in the cluster for each algorithm and feature type. The output should be in a JSON file in the format shown in the snapshot of the Execution Example section below. There should be 2 output files – one for Word Count and another one TF-IDF based clustering. 

 

Execution Example
 

Following we present examples of how you can run your program with spark submit both when your application is a Java/Scala program or a Python script. 

 

        A.   Example of running a Java/Scala application with spark-submit:

Notice that the argument class of the spark-submit specifies the main class of your application and it is followed by the jar file of the application.

$SPARK_HOME/bin/spark-submit --​class ​Task1 ​Firstname_Lastname_KMeans​.​jar <input_file <feature <N <Iteration

 

 Example of running a Python application with spark-submit:

$SPARK_HOME/bin/spark-submit ​

<feature <N <Iteration
Firstname_Lastname_KMeans​.py​
<input_file
 

 

 

 

Example output  


  

Deliverables for Problem 1
1.                Script or Jar File and Source Code

Please name your Python script as: <firstname_<lastname_KMeans.py.  Or if you submit a jar file as: <firstname_<lastname_KMeans.jar.

The python script or the .jar file of your implementation should be inside the Solution ​directory of your submission along with your source codes.

2.                Output Files 

We need two output files for Problem 1. 

Run​ ​your program against ​Data/ yelp_reviews_clustering_small.txt   

 

The format of the output should be exactly the same as the above snapshot for both cases.

The names of the output files should be as:

<firstname_<lastname_KMeans_small_W_5_20. json

<firstname_<lastname_KMeans_small_T_5_20.json

 

Problem 2  Implementing Clustering Algorithms using Spark with the Yelp Reviews Small Dataset
 

For this problem you will generate clusters of size ​8 with random seed of 42​ on Yelp Reviews dataset (​(​Data/yelp_reviews_clustering_small.txt)​. You will run K-Means, bisecting K-Means, and GMM algorithms in Spark.  The maximum number of iterations should be set to ​20​.  

Deliverables for Problem 2
1.                Script or Jar File and Source Code

Please name your Python script as: <firstname_<lastname_Cluster.py.  Or if you submit a jar file as: <firstname_<lastname_Cluster.jar.

The python script or the .jar file of your implementation should be inside the Solution ​directory of your submission along with the source codes. 

Input Arguments:

1. Input.txt: This is the path to the input reviews file. Every line contains a review.  For​        Problem 2 you should use the ​Data/yelp_reviews_clustering_small.txt​ file  

2.​  ​Algorithm=K​:  K for K-Means, B for bisecting K-Means, G  for GMM

3.  N=8: ​Number of clusters

4.​  ​Iterations =20​: Maximum number of iterations

 

2.                Output Files 

There should be three output files corresponding to K-Means, bisecting K-Means, and GMMs. The K-Means and bisecting K-Means will contain the cluster sizes, within set sum of squared error, and top ten terms from the cluster. For GMM, the output will contain the weights, means, and standard deviation for each component. Example output is shown below. 

 

Execution Example
 

Following we present examples of how you can run your program with spark submit both when your application is a Java/Scala program or a Python script. 

 

        B.   Example of running a Java/Scala application with spark-submit:

Notice that the argument class of the spark-submit specifies the main class of your application and it is followed by the jar file of the application.

$SPARK_HOME/bin/spark-submit --​class ​Task2 ​Firstname_Lastname_Cluster​.​jar

<input_file <algorithm <N <Iteration

 

 Example of running a Python application with spark-submit:

$SPARK_HOME/bin/spark-submit ​

<algorithm <N <Iteration
Firstname_Lastname_Cluster​.​py
<input_file
 

 

Example K-Means Output



 

Example Bisecting K-Means Output



 

Example GMM Output



  

The output files should be named as:

<firstname_<lastname_Cluster_small_K_8_20.json

<firstname_<lastname_Cluster_small_B_8_20. json

<firstname_<lastname_Cluster_small_G_8_20. json

 

 

Problem 3 Implementing Clustering using Spark with the Yelp Reviews Large Dataset
 

For this problem you will generate clusters of size ​16 with random seed of 42​ on Yelp Reviews dataset (​(​Data/yelp_reviews_clustering_large.txt)​. You will run K-Means, bisecting K-Means, and GMM algorithms in Spark.  

 

Deliverables for Problem 3
               1.         Output Files 

 

The format of the output should be exactly the same as the one for Problem 1. 

The output files should be named as:

<firstname_<lastname_Cluster_large_K_16_20.json

<firstname_<lastname_Cluster_large_B_16_20. json

<firstname_<lastname_Cluster_large_G_16_20. json

The above output files should be placed inside the ​OutputFiles​ directory of your submission.

Grade breakdown

a. Three correct output files (10 pts each) 

 

Plot the clusters using your favorite visualization tool. You are to choose how you want to visualize the clusters. The visualization should be limited to 1 page including brief text describing the clusters. Include this as Firstname_LastName_Visualization.pdf

 

 

 

 

APPENDIX A

 

Please include the following information inside your description document.  

● Succinctly describe your approach to implement the algorithm. ● Commands to execute your program 

 

More products