$30
The main objective of this laboratory is to put into practice what you have learned on clustering techniques and itemset mining. You will mainly work on textual data, a domain where the data preparation phase is crucial to any subsequent task. Specifically, you will try to detect topics out of a set of real-world news data. Then, you will describe each cluster through frequent itemset mining.
Important note. For what concerns this laboratory, you are encouraged to upload your results to our online verification platform, even though the submission will not count on your final exam mark. Doing so, you can practice with the same system that you will use for the final exam. Reference Section 3 to read more about it.
1 Preliminary steps
1.1 Useful libraries
As you may have already understood, the Python language comes with many handy functions and thirdparty libraries that you need to master to avoid boilerplate code. In many cases, you should leverage them to focus on the analysis process rather than its implementation.
That said, we listed a series of libraries you can make use of in this laboratory:
• NumPy
• scikit-learn
• Natural Language Toolkit
• SciPy
We will point out their functions and classes when needed. In many cases, their full understanding decreases significantly your programming effort: take your time to explore their respective documentations.
Warning: we have noticed from previous laboratories that in some cases copying snippets of code directly from the PDF file leaded to wrong behaviours in Jupyter notebooks. Please consider to write them down by yourself.
1.2 wordcloud
Make sure you have this library installed. As usual, if not available, you need to install it with pip install wordcloud (or any other package manager you may be using). The wordcloud library is a word cloud generator. You can read more about it on its official website.
1.3 Datasets
For this laboratory, a single real-world dataset will be used.
1.3.1 20 Newsgroups
The 20 Newsgroups dataset was originally collected in Lang 1995. It includes approximately 20,000 documents, partitioned across 20 different newsgroups, each corresponding to a different topic.
For the sake of this laboratory, we chose T ≤20 topics and sampled uniformly only documents belonging to them. As a consequence, you have K ≤20,000 documents uniformly distributed across T different topics. You can download the dataset at: https://github.com/dbdmg/data-science-lab/blob/master/datasets/T-newsgroups.zip?raw=true Each document is located in a different file, which contains the raw text of the news. The name of the file in an integer number and corresponds to its ID.
2 Exercises
Note that exercises marked with a (*) are optional, you should focus on completing the other ones first.
2.1 Newsgroups clustering
In this exercise you will build your first complete data analytics pipeline. More specifically, you will load, analyze and prepare the newsgroups dataset to finally identify possible clusters based on topics. Then, you will evaluate your process through any clustering quality measure.
1. Load the dataset from the root folder. Here the Python’s os module comes to your help. You can use the os.listdir function to list files in a directory.
2. Focus now on the data preparation step. As you have learned in laboratory 2, textual data needs to be processed to obtain a numerical representation of each document. This is typically achieved via the application of a weighting schema.
Choose now one among the weighting schema that you know and transform each news into a numerical representation. The Python implementation of a simple TFIDF weighting schema is provided in section 2.1.1, you can use it as starting point.
This preprocessing phase is likely going to influence the quality of your results the most. Pay enough attention to it. You could try to answer the following questions:
• Which weigthing schema have you used?
• Have you tried to remove stopwords?
• More generally, have you ignored words with a document frequency lower than or higher than a given threshold?
• Have you applied any dimensionality reduction strategy? This is not mandatory, but in some cases it can improve your results. You can find more details in Appendix 3.3.
3. Once you have your vector representation, choose one clustering algorithm of those you know and apply it to your data.
4. You can now evaluate the quality of the cluster partitioning you obtained. There exists many metrics based on distances between points (e.g. the Silhouette or the Sum of Squared Errors (SSE)) that you can explore. Choose one of those that you known and test your results on your computer.
5. Consider now that our online system will evaluate your cluster quality based on the real cluster labels (a.k.a. the ground truth, that you do not have). Consequently, it could happen that a cluster subdivision achieves an high Silhouette value (i.e. geometrically close points were assigned to the same cluster) while the matching with the real labels gives a poor score (i.e. real labels are heterogeneous within your clusters).
In order to understand how close you came to the real news subdivision, upload your results to our online verification system (you can perform as many submission as you want for this laboratory, the only limitation being a time limit of 5 minutes between submissions). Head to Section 3 to learn more about it.
2.1.1 A basic TFIDF implementation
The transformation from texts to vector can be simplified by means of ad-hoc libraries like Natural Language Toolkit and scikit-learn (from now on, nltk and sklearn). If you plan to use the TFIDF weighting schema, you might want to use the sklearn’s TfidfVectorizer class. Then you can use its fit_transform method to obtain the TFIDF representation for each document. Specifically, the method returns a SciPy sparse matrix. You are encouraged to exhaustively analyze Tfidf Vectorizer’s constructor parameters since they can significantly impact the results. Note for now that you can specify a custom tokenizer object and a set of stopwords to be used.
For the sake of simplicity, we are providing you with a simple tokenizer class. Note that the TfidfTokenizer’s tokenizer argument requires a callable object. Python’s callable objects are instances of classes that implement the __call__ method. The class makes use of two nltk functionalities: word_tokenize and the class WordNetLemmatizer. The latter is used to lemmatize your words after the tokenization. The lemmatization process leverages a morphological analysis of the words in the corpus with the aim to remove the grammatical inflections that characterize a word in different contexts, returning its base or dictionary form (e.g. {am, are, is} ⇒ be; {car, cars, car’s, cars’} ⇒ car).
For what concerns the stop words, you can use again a nltk already-available function: stopwords. The following is a snippet of code including everything you need to get to a basic TFIDF representation:
from sklearn.feature_extraction.text import TfidfVectorizer from nltk.tokenize import word_tokenize from nltk.stem.wordnet import WordNetLemmatizer from nltk.corpus import stopwords as sw
class LemmaTokenizer(object):
def __init__(self):
self.lemmatizer = WordNetLemmatizer()
def __call__(self, document):
lemmas = [] for t in word_tokenize(document): t = t.strip()
lemma = self.lemmatizer.lemmatize(t) lemmas.append(lemma)
return lemmas
lemmaTokenizer = LemmaTokenizer()
vectorizer = TfidfVectorizer(tokenizer=lemmaTokenizer, stop_words=sw.words('english')) tfidf_X = vectorizer.fit_transform(corpus)
2.2 Cluster characterization by means of word clouds and itemset mining
In many real cases, the real clustering subdivision is not accessible at all[1]. Indeed, it is what you want to discover by clustering your data. For this reason, it is commonplace to add a further step to the pipeline and try to characterize the clusters by inspecting their points’ characteristics. This is especially true while working with news, where the description can lead to the identification of a topic shared among all the documents assigned to it (e.g. one of your clusters may contain news related to sports).
In this exercise you will exploit word clouds and frequent itemset algorithms to characterize the clusters obtained in the previous exercise.
1. Split your initial data into separate chunks accordingly to the cluster labels obtained in the previous exercise. For each of them, generate a Word Cloud image using the wordcloud library. Take a look at the library documentation to learn how to do it. Can you figure out the topic shared among all the news of each cluster?
2. (*) Provide a comment for each word cloud and upload the images and the comments. Head to section 3 to know how.
3. (*) One further analysis can exploit the frequent itemset algorithms. Choose one algorithm and run it for each cluster of news. Try to identify the most distinctive set of words playing around with different configurations of the chosen algorithm. Based on the results, can you identify any topic in any of your clusters?
[1] Or worse, it might not exist at all.