Starting from:

$25

Machine - Learning -   Assignment 3 - Clustering and Dimension Reduction  - Solved

In this assignment you will get familiar with some common techniques used in clustering of data and its analysis.

A. Dataset Preparation:                                                                   [5 points]  
Download the Religious Texts Dataset from ​here​ (also to be uploaded on Moodle). Use the Labelled dataset for this assignment. The dataset contains the ​              ​Document Term Matrix​ (DTM) from 8 different Religious Texts.  

 

1.    The first column contains the names of the religious text and their corresponding chapters. Replace the labels with only the names of the religious text (remove the chapter numbers, for eg., “Buddhism_Ch1” should become “Buddhism”). These are the class labels, and there are 8 of them. The rest of the columns represent the term frequency (frequency of the corresponding terms in the documents). ​The 14th row in the data (index 13 when you start at 0), “Buddhism_Ch14” is all zeros, because the original text is empty. Remove this row from the data and adjust the indices accordingly. 

 

2.    Now we want to convert the DTM to a ​TF-IDF​ matrix. ​Note: Do NOT use the text provided in the corpus for computation of TF-IDF. 

 

To calculate the TF-IDF value of a term ‘t’ in the document ‘d’, use  tf-idf ​(d, t) = tf ​ ​(t, d) x idf ​ ​(t),  

where ​tf ​(t, d) represents the term frequency of the term in the document (as given in the DTM), and ​idf​ (t) represents the inverse document frequency of the term.  

 

Use idf ​ ​(t) = log [ (1 + n) / (1 + df ​ ​(t)) ], where n is the total number of​ documents, and df(t) represents the document frequency (number of documents in which the term is present).

 

Finally we consider each document as a vector of TF-IDF scores for the different terms. Normalize each vector by dividing it with the length of the vector (L2 norm).  

 

You can use ​scikit-learn​ (or any equivalent library in other programming languages) to obtain this TF-IDF matrix.  

 
We define the similarity between two vectors by their cosine similarity, which is equivalent to the dot product of the two unit normalized vectors. ​For this assignment, consider the ​distance between two vectors as the negative exponential of their cosine similarity (​NOT the Euclidean​    distance). ​Thus if the cosine similarity of two vectors is z, the distance is e​ (-​ z) 

 

 

B. Agglomerative Clustering:                                                          [35 points]
Implement a hierarchical agglomerative clustering algorithm, to obtain 8 clusters of documents. Use the ​single linkage​ strategy to join clusters. ​Note: Do NOT use any ML library for this part. 

 

Print the clusters into a file named ​‘agglomerative.txt’ in the following format: Each line will​            represent a different cluster, and will contain a sorted comma separated list of the indices of the data points in that cluster. Sort the clusters by the minimum index of the data points present in that cluster.  

Eg: if suppose you obtain clusters [1,3,5], [2], [4,0], then print out:

0,4 

1,3,5 



Here the numbers represent the index of the corresponding documents in the dataset (excluding the header).

 

C. KMeans Clustering:                                                                    [35 points]
Implement the standard KMeans clustering algorithm, using the given notion of distance (as defined in Task A), to obtain K=8 clusters of documents. Initialize the cluster centers randomly.

Note: Do NOT use any ML library for this part.

Print the clusters into a file named ​‘kmeans.txt’ in the same format as in Part B.​ 

 

D. Attribute Reduction by Principal Component Analysis:            [5 points]
Reduce the number of attributes of the dataset to 100 by using PCA. ​You can use the implementation of PCA from ​scikit-learn (or from any equivalent library in other programming​     languages).  

Use this reduced dataset and again obtain 8 clusters using your implementations of

Agglomerative Clustering and KMeans Clustering. Print the clusters into files

‘agglomerative_reduced.txt’ and ​     ​‘kmeans_reduced.txt’ respectively, in the same format as​  specified in part B.

 

More products