$25
Large-Scale Data Mining: Models and Algorithms
Introduction
Statistical classification refers to the task of identifying a category, from a predefined set, to which a data point belongs, given a training data set with known category memberships. Classification differs from the task of clustering, which concerns grouping data points with no predefined category memberships, where the objective is to seek inherent structures in data with respect to suitable measures. Classification turns out as an essential element of data analysis, especially when dealing with a large amount of data. In this project, we look into different methods for classifying textual data.
In this project, the goal includes:
1. To learn how to construct tf-idf representations of textual data.
2. To get familiar with various common classification methods.
3. To learn ways to evaluate and diagnose classification results.
4. To learn two dimensionality reduction methods: PCA & NMF.
5. To get familiar with the complete pipeline of a textual data classification task.
6. To get familiar with pre-trained word embeddings.
Getting familiar with the dataset
We work with “20 Newsgroups” dataset, which is a collection of approximately 20,000 documents, partitioned (nearly) evenly across 20 different newsgroups (newsgroups are discussion groups like forums, which originated during the early age of the Internet), each corresponding to a different topic.
One can use fetch 20newsgroups provided by scikit-learn to load the dataset. Detailed usages can be found at https://scikit-learn.org/stable/datasets/#the-20-newsgroups-text-dataset
In a classification problem one should make sure to properly handle any imbalance in the relative sizes of the data sets corresponding to different classes. To do so, one can either modify the penalty function (i.e. assign more weight to errors from minority classes), or alternatively, down-sample the majority classes, to have the same number of instances as minority classes.
QUESTION 1: To get started, plot a histogram of the number of training documents for each of the 20 categories to check if they are evenly distributed.
Note that the data set is already balanced (especially for the categories we’ll mainly work on) and so in this case we do not need to balance. But in general, as a data scientist you need to be aware of this issue.
Binary Classification
Before the following parts, to ensure consistency, please set the random seed as follows:
import numpy as np np.random.seed(42) import random random.seed(42)
To get started, we work with a well separable portion of data, and see if we can train a classifier that distinguishes two classes well. Concretely, let us take all the documents in the following classes:
Table 1: Two well-separated classes
Computer Technology
comp.graphics comp.os.ms-windows.misc comp.sys.ibm.pc.hardware comp.sys.mac.hardware
Recreational Activity
rec.autos rec.motorcycles rec.sport.baseball rec.sport.hockey
Specifically, use the settings as the following code to load the data:
categories = ['comp.graphics', 'comp.os.ms-windows.misc',
'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware',
'rec.autos', 'rec.motorcycles',
'rec.sport.baseball', 'rec.sport.hockey']
train_dataset = fetch_20newsgroups(subset = 'train', categories = categories,
,→ shuffle = True, random_state = None) test_dataset = fetch_20newsgroups(subset = 'test', categories = categories,
,→ shuffle = True, random_state = None)
1 Feature Extraction
The primary step in classifying a corpus of text is choosing a proper document representation. A good representation should retain enough information that enable us to perform the classification, yet in the meantime, be concise to avoid computational intractability and over fitting.
One common representation of documents is called “Bag of Words”, where a document is represented as a histogram of term frequencies, or other statistics of the terms, within a fixed vocabulary. As such, a corpus of text can be summarized into a term-document matrix whose entries are some statistic of the terms.
First a common sense filtering is done to drop certain words or terms: to avoid unnecessarily large feature vectors (vocabulary size), terms that are too frequent in almost every document, or are very rare, are dropped out of the vocabulary. The same goes with special characters, common stop words (e.g. “and”, “the” etc.), In addition, appearances of words that share the same stem in the vocabulary (e.g. “goes” vs “going”) are merged into a single term.
Further, one can consider using the normalized count of the vocabulary words in each document to build representation vectors. A popular numerical statistic to capture the importance of a word to a document in a corpus is the “Term Frequency-Inverse Document Frequency (TF-IDF)” metric. This measure takes into account count of the words in the document, as normalized by a certain function of the frequency of the individual words in the whole corpus. For example, if a corpus is about computer accessories then words such as “computer” “software” “purchase” will be present in almost every document and their frequency is not a distinguishing feature for any document in the corpus. The discriminating words will most likely be those that are specialized terms describing different types of accessories and hence will occur in fewer documents. Thus, a human reading a particular document will usually ignore the contextually dominant words such as “computer”, “software” etc. and give more importance to specific words. This is like when going into a very bright room or looking at a bright object, the human perception system usually applies a saturating function (such as a logarithm or square-root) to the actual input values before passing it on to the neurons. This makes sure that a contextually dominant signal does not overwhelm the decision-making processes in the brain. The TF-IDF functions draw their inspiration from such neuronal systems. Here we define the TF-IDF score to be
tf-idf(d,t) = tf(t,d) × idf(t)
where tf(d,t) represents the frequency of term t in document d, and inverse document frequency is defined as:
idf(
where n is the total number of documents, and df(t) is the document frequency, i.e. the number of documents that contain the term t.
QUESTION 2: Use the following specs to extract features from the textual data:
• Use the “english” stopwords of the CountVectorizer
• Exclude terms that are numbers (e.g. “123”, “-45”, “6.7” etc.)
• Perform lemmatization with nltk.wordnet.WordNetLemmatizer and pos tag
• Use min df=3
Report the shape of the TF-IDF matrices of the train and test subsets respectively.
Please refer to the official documentation of CountVectorizer as well as the discussion section notebooks for details.
2 Dimensionality Reduction
After above operations, the dimensionality of the representation vectors (TF-IDF vectors) ranges in the order of thousands. However, learning algorithms may perform poorly in highdimensional data, which is sometimes referred to as “The Curse of Dimensionality”. Since the document-term TF-IDF matrix is sparse and low-rank, as a remedy, one can select a subset of the original features, which are more relevant with respect to certain performance measure, or transform the features into a lower dimensional space.
In this project, we use two dimensionality reduction methods: Latent Semantic Indexing (LSI) and Non-negative Matrix Factorization (NMF), both of which minimize mean squared residual between the original data and a reconstruction from its low-dimensional approximation. Recall that our data is the term-document TF-IDF matrix, whose rows correspond to TF-IDF representation of the documents, i.e.
tfidf(d1,t1) ··· tfidf(d1,tm) tfidf(d2,t1) ··· tfidf(d2,tm)
X ... ...
tfidf(dn,t1) ··· tfidf
(which is the case for the output of CountVectorizer and TfidfTransformer).
LSI
The LSI representation is obtained by computing left and right singular vectors corresponding to the top k largest singular values of the term-document TF-IDF matrix X.
We perform SVD to the matrix X, resulting in X = UΣVT, U and V orthogonal. Let the singular values in Σ be sorted in descending order, then the first k columns of U and V are called Uk and Vk respectively. Vk consists of the principle components in the feature space.
Then we use (XVk) (which is also equal to (UkΣk)) as the dimension-reduced data matrix, where rows still correspond to documents, only that they can have (far) lower dimension. In this way, the number of features is reduced. LSI is similar to Principal Component Analysis (PCA), and you can see the lecture notes for their relationships.
Having learnt U and V, to reduce the test data, we just multiply the test TF-IDF matrix Xt by Vk, i.e. Xt,reduced = XtVk. By doing so, we actually project the test TF-IDF vectors to the principle components, and use the projections as the dimension-reduced data.
NMF
NMF tries to approximate the data matrix X ∈ Rn×m (i.e. we have n docs and m terms) with WH (W ∈ Rn×r, H ∈ Rr×m). Concretely, it finds the non-negative matrices W and H s.t.
s
kX − WH is minimized. (kAkF ≡ XA2ij ) Then we use W as the dim-reduced data
i,j
matrix, and in the fit step, we calculate both W and H. The intuition behind this is that we are trying to describe the documents (the rows in X) as a (non-negative) linear combination of r topics:
X WH
Here we see h “topics”, each of which consists of m scores, indicating how important each term is in the topic. Then x .
Now how do we calculate the dim-reduced test data matrix? Again, we try to describe the document vectors (rows by our convention here) in the test data (call it Xt) with (non-negative) linear combinations of the “topics” we learned in the fit step. The “topics”, again, are the rows of H matrix, . How do we do that? Just solve the optimization problem
2
where H is fixed as the H matrix we learned in the fit step. Then Wt is used as the dim-reduced version of Xt.
QUESTION 3: Reduce the dimensionality of the data using the methods above
• Apply LSI to the TF-IDF matrix corresponding to the 8 categories with k = 50; so each document is mapped to a 50-dimensional vector.
• Also reduce dimensionality through NMF (k = 50) and compare with LSI:
Which one is larger, the kX − WH in NMF or the in LSI?
Why is the case?
3 Classification Algorithms
In this part, you are asked to use the dimension-reduced training data from LSI to train (different types of) classifiers, and evaluate the trained classifiers with test data. Your task would be to classify the documents into two classes “Computer Technology” vs “Recreational Activity”. Refer to Table 1 to find the 4 categories of documents comprising each of the two classes. In other words, you need to combine documents of those sub-classes of each class to form the set of documents for each class.
Classification measures
Classification quality can be evaluated using different measures such as precision, recall, F-score, etc. Refer to the discussion material to find their definition.
Depending on application, the true positive rate (TPR) and the false positive rate (FPR) have different levels of significance. In order to characterize the trade-off between the two quantities, we plot the receiver operating characteristic (ROC) curve. For binary classification, the curve is created by plotting the true positive rate against the false positive rate at various threshold settings on the probabilities assigned to each class (let us assume probability p for class 0 and 1 − p for class 1). In particular, a threshold t is applied to value of p to select between the two classes. The value of threshold t is swept from 0 to 1, and a pair of TPR and FPR is got for each value of t. The ROC is the curve of TPR plotted against FPR.
SVM
Linear Support Vector Machines have been proved efficient when dealing with sparse high dimensional datasets, including textual data. They have been shown to have good generalization accuracy, while having low computational complexity.
Linear Support Vector Machines aim to learn a vector of feature weights, w, and an intercept, b, given the training dataset. Once the weights are learned, the label of a data point is determined by thresholding wTx + b with 0, i.e. sign(wTx + b). Alternatively, one produce probabilities that the data point belongs to either class, by applying a logistic function instead of hard thresholding, i.e. calculating σ(wTx + b).
The learning process of the parameter w and b involves solving the following optimization problem:
n
s.t. yi(wTxi + b) ≥ 1 − ξi ξi ≥ 0, ∀i ∈ {1,...,n}
where xi is the ith data point, and yi ∈ {0,1} is the class label of it.
Minimizing the sum of the slack variables corresponds to minimizing the loss function on the training data. On the other hand, minimizing the first term, which is basically a regularization, corresponds to maximizing the margin between the two classes. Note that in the objective function, each slack variable represents the amount of error that the classifier can tolerate for a given data sample. The tradeoff parameter γ controls relative importance of the two components of the objective function. For instance, when 1, misclassification of individual points is highly penalized, which is called “Hard Margin SVM”. In contrast, a “Soft Margin SVM”, which is the case when 1, is very lenient towards misclassification of a few individual points as long as most data points are well separated.
QUESTION 4: Hard margin and soft margin linear SVMs:
• Train two linear SVMs and compare:
– Train one SVM with γ = 1000 (hard margin), another with γ = 0.0001 (soft margin).
– Plot the ROC curve, report the confusion matrix and calculate the accuracy, recall, precision and F-1 score of both SVM classifier. Which one performs better?
– What happens for the soft margin SVM? Why is the case?
∗ Does the ROC curve of the soft margin SVM look good? Does this conflict with other metrics?
• Use cross-validation to choose γ (use average validation accuracy to compare):
Using a 5-fold cross-validation, find the best value of the parameter γ in the range {10k|−3 ≤ k ≤ 3,k ∈ Z}. Again, plot the ROC curve and report the confusion matrix and calculate the accuracy, recall precision and F-1 score of this best SVM.
Logistic Regression
Although its name contains “regression”, logistic regression is a probability model that is used for binary classification.
In logistic regression, a logistic function (σ(φ) = 1/(1+exp(−φ))) acting on a linear function of the features (φ(x) = wTx+b) is used to calculate the probability that the data point belongs to class 1, and during the training process, w and b that maximizes the likelihood of the training data are learnt.
One can also add regularization term in the objective function, so that the goal of the training process is not only maximizing the likelihood, but also minimizing the regularization term, which is often some norm of the parameter vector w. Adding regularization helps prevent ill-conditioned results and over-fitting, and facilitate generalization ability of the classifier. A coefficient is used to control the trade-off between maximizing likelihood and minimizing the regularization term.
QUESTION 5: Logistic classifier:
• Train a logistic classifier without regularization (you may need to come up with some way to approximate this if you use sklearn.linear model.LogisticRegression); plot the ROC curve and report the confusion matrix and calculate the accuracy, recall precision and F-1 score of this classifier.
• Regularization:
– Using 5-fold cross-validation on the dimension-reduced-by-svd training data, find the best regularization strength in the range {10k|−3 ≤ k ≤ 3,k ∈ Z} for logistic regression with L1 regularization and logistic regression L2 regularization, respectively.
– Compare the performance (accuracy, precision, recall and F-1 score) of 3 logistic classifiers: w/o regularization, w/ L1 regularization and w/ L2 regularization (with the best parameters you found from the part above), using test data.
– How does the regularization parameter affect the test error? How are the learnt coefficients affected? Why might one be interested in each type of regularization?
– Both logistic regression and linear SVM are trying to classify data points using a linear decision boundary, then what’s the difference between their ways to find this boundary? Why their performance differ?
Na¨ıve Bayes
Scikit-learn provides a type of classifiers called “na¨ıve Bayes classifiers”. They include MultinomialNB, BernoulliNB, and GaussianNB.
Na¨ıve Bayes classifiers use the assumption that features are statistically independent of each other when conditioned by the class the data point belongs to, to simplify the calculation for the Maximum A Posteriori (MAP) estimation of the labels. That is,
P(xi | y,x1,...,xi−1,xi+1,...,xm) = P(xi | y) i ∈ {1,...,m}
where xi’s are features, i.e. components of a data point, and y is the label of the data point.
Now that we have this assumption, a probabilistic model is still needed; the difference between MultinomialNB, BernoulliNB, and GaussianNB is that they use different models.
QUESTION 6: Na¨ıve Bayes classifier: train a GaussianNB classifier; plot the ROC curve and report the confusion matrix and calculate the accuracy, recall, precision and F-1 score of this classifier.
Grid Search of Parameters
Now we have gone through the complete process of training and testing a classifier. However, there are lots of parameters that we can tune. In this part, we fine-tune the parameters.
QUESTION 7: Grid search of parameters:
• Construct a Pipeline that performs feature extraction, dimensionality reduction and classification;
• Do grid search with 5-fold cross-validation to compare the following (use test accuracy as the score to compare):
Table 2: Options to compare
Procedure
Options
Loading Data
remove “headers” and “footers” vs not
Feature Extraction
min df = 3 vs 5;
use lemmatization vs not
Dimensionality Reduction
LSI vs NMF
Classifier
SVM with the best γ previously found
vs
Logistic Regression: L1 regularization vs L2 regularization, with the best regularization strength previously found vs
GaussianNB
Other options
Use default
• What is the best combination?
Hint: see
https://scikit-learn.org/stable/auto_examples/compose/plot_compare_reduction.html and http://www.davidsbatista.net/blog/2018/02/23/model_optimization/
Word Embedding
In the previous parts of this project we relied on TF-IDF vector representation of text for our feature extraction. There are two disadvantage of such a representation:
1. It’s not universal. The fixed vocabulary is generated from the corpus: the representation generation process is not well-transferable to other tasks or datasets.
2. TF-IDF does not encode the meaning of a word. It calculates the scores based on occurrences in the text rather than the context.
In this section we extract word features using a framework called GLoVE: Global Vectors for Word Representation. These vectors are trained on a multitude of large text corpora to take into account the broader contexts of words, making the final word representations informative and useful in a downstream task without any supervision or knowledge of the data used in the task.
QUESTION 8: Read the paper about GLoVE embeddings - found here and answer the following subquestions:
(a) Why are GLoVE embeddings trained on the ratio of co-occurrence probabilities rather than the probabilities themselves?
(b) In the two sentences: “James is running in the park.” and “James is running for the presidency.”, would GLoVE embeddings return the same vector for the word running in both cases? Why or why not?
(c) What do you expect for the values of,
||GLoVE["queen"] - GLoVE["king"] - GLoVE["wife"] + GLoVE["husband"]||2,
||GLoVE["queen"] - GLoVE["king"]||2 and ||GLoVE["wife"] - GLoVE["husband"]||2 ? Compare these values.
(d) Given a word, would you rather stem or lemmatize the word before mapping it to its GLoVE embedding?
You can download the text files that contain the embeddings here. Note that the zip file contains 4 different versions of GLoVE. Use glove.6B.300d.txt for this project. The code below shows how to use GLOVE embedding:
embeddings_dict = {} dimension_of_glove = 300 with open("glove/glove.6B.300d.txt", 'r') as f:
for line in f:
values = line.split() word = values[0] vector = np.asarray(values[1:], "float32") embeddings_dict[word] = vector
Now using this dictionary, for any word such as “computer”, you can find its embedding by:
print(embeddings_dict['computer'])
GLoVE embedding is a representation for each word while TF-IDF vector is per text segment. A text segment consists of many words.
Simple concatenation of all of the embeddings of words in that segment will result in a highdimensional vector representation of varying sizes. Therefore, we need to design a feature construction procedure to make use of the word embeddings. The goals include:
1. The representation of segments of varying lengths should be aligned to the same dimension to facilitate downstream applications;
2. The dimension of the representation should not be too high;
3. The representation should be a sufficient description of the content of a document, so as to support the classification.
QUESTION 9: For the binary classification task distinguishing the “Computer Technology” class and “Recreational Activity” class:
(a) Describe a feature engineering process that uses GLoVE word embeddings to represent each document. You have to abide by the following rules:
• A representation of a text segment needs to have a vector dimension that CANNOT exceed the dimension of the GLoVE embedding used per word of the segment.
• You cannot use TF-IDF scores (or any measure that requires looking at the complete dataset) as a pre-processing routine.
• In each document, there are specific phrases that have more important topical words than others. They are highlighted by “Keywords: ... ” or “Subject: ... ”. Use the average embedding of these words for each document representation.
• To aggregate these words into a single vector consider normalizing the final vectors.
(b) Select a classifier model, train and evaluate it with your GLoVE-based feature.
If your model is performing poorly, try to improve your feature engineering pipeline and also revisit your fine-tuned parameters.
QUESTION 10: Plot the relationship between the dimension of the pre-trained GLoVE embedding and the resulting accuracy of the model in the classification task. Describe the observed trend. Is this trend expected? Why or why not?
Visualize the set of normalized GLoVE-based embeddings of the documents with their labels in a 2D plane using the UMAP library. Similarly generate a set of normalized random vectors of the same dimension as GLoVE and visualize these in a 2D plane with UMAP.
QUESTION 11: Compare and contrast the two visualizations. Are there clusters formed in either or both of the plots?
Multiclass Classification
So far, we have been dealing with classifying the data points into two classes. In this part, we explore multiclass classification techniques through different algorithms.
Some classifiers perform the multiclass classification inherently. As such, na¨ıve Bayes algorithm finds the class with maximum likelihood given the data, regardless of the number of classes. In fact, the probability of each class label is computed in the usual way, then the class with the highest probability is picked; that is
cˆ= argminP(c | x)
c∈C
where c denotes a class to be chosen, and ˆc denotes the optimal class.
For SVM, however, one needs to extend the binary classification techniques when there are multiple classes. A natural way to do so is to perform a one versus one classification on all
pairs of classes, and given a document the class is assigned with the majority vote.
In case there is more than one class with the highest vote, the class with the highest total classification confidence levels in the binary classifiers is picked.
An alternative strategy would be to fit one classifier per class, which reduces the number of classifiers to be learnt to |C|. For each classifier, the class is fitted against all the other classes. Note that in this case, the unbalanced number of documents in each class should be handled. By learning a single classifier for each class, one can get insights on the interpretation of the classes based on the features.
QUESTION 12: In this part, we aim to learn classifiers on the documents belonging to the classes:
comp.sys.ibm.pc.hardware, comp.sys.mac.hardware, misc.forsale, soc.religion.christian
Perform Na¨ıve Bayes classification and multiclass SVM classification (with both One VS One and One VS the rest methods described above) and report the confusion matrix and calculate the accuracy, recall, precision and F-1 score of your classifiers.