Starting from:

$25

CSC4760 - Big Data Programming - Assignment 6 - Solved

Implement the 𝑘𝑘-means algorithm in Spark. Please use the following dataset in your experiments.

Input Dataset :     kmeans_input.txt   (Uploaded into iCollege)

In Windows, please use “WordPad” to open it.

The dataset is in “libsvm” format, please use the following sentence to read it in Spark.

dataset = spark.read.format("libsvm").load("/home/rob/data/kmeans_input.txt")

In the dataset, each row represents a data point. In total, there are 200 rows, which means there are 200 data points. Each data point contains two features, which represents the x and y coordinates.

The following figure visualizes the dataset.

  

We can see that there are two clusters. The following figure shows the two clusters.

  

Problem: You are required to use K-Means algorithm to compute the two clusters. The input is the raw data in “kmeans_input.txt”, the output is the data point with cluster labels.

Implementation: 

Design and implement a PySpark program to solve the problem. We did not provide any template python file this time. You may want to create one python file from scratch.

You are required to use the k-means function Spark Machine Learning library to implement this function. Please refer to the following webpage for more details. Spark MLlib Clustering  -  K-means https://spark.apache.org/docs/latest/ml-clustering.html

Refer to the Python API docs for more details.

https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.clustering.KMean 

More products