BIA660- Assignment 6- Clustering and Topic Modeling Solved
In this assignment, you'll need to use the following dataset:
text_train.json: This file contains a list of documents. It's used for training models text_test.json: This file contains a list of documents and their ground-truth labels. It's used for testing performance. This file is in the format shown below. Note, each document has a list of labels. You can load these files using json.load()
Text Labels paraglider collides with hot air balloon ... ['Disaster and Accident', 'Travel & Transportation'] faa issues fire warning for lithium ... ['Travel & Transportation'] .... ... Q1: K-Mean Clustering Define a function cluster_kmean() as follows:
Take two file name strings as inputs: train_file is the file path of text_train.json, and test_file is the file path of text_test.json
Use KMeans to cluster documents in train_file into 3 clusters by cosine similarity Test the clustering model performance using test_file:
Predict the cluster ID for each document in test_file.
Let's only use the first label in the ground-truth label list of each test document, e.g. for the first document in the table above, you set the ground_truth label to "Disaster and Accident" only.
Apply majority vote rule to dynamically map the predicted cluster IDs to the ground-truth labels in test_file. Be sure not to hardcode the mapping (e.g. write code like {0: "Disaster and Accident"}), because a cluster may corrspond to a different topic in each run.
Calculate precision/recall/f-score for each label
This function has no return. Print out confusion matrix, precision/recall/f-score.
Q2: LDA Clustering Define a function cluster_lda() as follows:
Take two file name strings as inputs: train_file is the file path of text_train.json, and test_file is the file path of text_test.json
Use LDA to train a topic model with documents in train_file and the number of topics K = 3 Predict the topic distribution of each document in test_file, and select only the topic with highest probability as the predicted topic
Evaluates the topic model performance as follows:
Similar to Q1, let's use the first label in the label list of test_file as the ground_truth label.
Apply majority vote rule to map the topics to the labels.
Calculate precision/recall/f-score for each label and print out precision/recall/f-score. Return topic distribution and the original ground-truth labels of each document in test_file Also, provide a document which contains:
performance comparison between Q1 and Q2 describe how you tune the model parameters, e.g. min_df, alpha, max_iter etc.
Q3 (Bonus): Overlapping Clustering In Q2, you predict one label for each document in test_file. In this question, try to discover multiple labels if appropriate. Define a function overlapping_cluster as follows:
Take the outputs of Q2 (i.e. topic distribution and the labels of each document in test_file) as inputs
Set a threshold for each topic (i.e. TH = [th0, th1, th2]). A document is predicted to belong to a topic i only if the topic probability thi for i ∈ [0, 1, 2].
The threshold is determined as follows:
Vary the threshold for each topic from 0.05 to 0.95 with an increase of 0.05 in each round to evalute the topic model performance:
Apply majority vote rule to map the predicted topics to the ground-truth labels in test_file
Calculate f1-score for each label
For each label, pick the threshold value which maximizes the f1-score
Return the threshold and f1-score of each label
In [145]:
from sklearn.feature_extraction.text import CountVectorizer from nltk.cluster import KMeansClusterer, cosine_distance from sklearn.decomposition import LatentDirichletAllocation
# add more
In [146]:
actual_class Disaster and Accident News and Economy Travel & Tran sportation
cluster 0 70 0 135
1 130 7 8
2 10 199
41
Cluster 0: Topic Travel & Transportation
Cluster 1: Topic Disaster and Accident
Cluster 2: Topic News and Economy precision recall f1-score support
iteration: 1 of max_iter: 25 iteration: 2 of max_iter: 25 iteration: 3 of max_iter: 25 iteration: 4 of max_iter: 25 iteration: 5 of max_iter: 25 iteration: 6 of max_iter: 25 iteration: 7 of max_iter: 25 iteration: 8 of max_iter: 25 iteration: 9 of max_iter: 25 iteration: 10 of max_iter: 25 iteration: 11 of max_iter: 25 iteration: 12 of max_iter: 25 iteration: 13 of max_iter: 25 iteration: 14 of max_iter: 25 iteration: 15 of max_iter: 25 iteration: 16 of max_iter: 25 iteration: 17 of max_iter: 25 iteration: 18 of max_iter: 25 iteration: 19 of max_iter: 25 iteration: 20 of max_iter: 25 iteration: 21 of max_iter: 25 iteration: 22 of max_iter: 25 iteration: 23 of max_iter: 25 iteration: 24 of max_iter: 25 iteration: 25 of max_iter: 25
actual_class Disaster and Accident News and Economy Travel & Tran sportation