Starting from:

$30

ECE219-Project 1 Classification Analysis on Textual Data Solved

QUESTION 1: To get started, plot a histogram of the number of training documents for each of the 20 categories to check if they are evenly distributed.

Note that the data set is already balanced (especially for the categories we’ll mainly work on) and so in this case we do not need to balance. But in general, as a data scientist you need to be aware of this issue.

Binary Classification
Before the following parts, to ensure consistency, please set the random seed as follows:

import numpy as np np.random.seed(42) import random random.seed(42)

To get started, we work with a well separable portion of data, and see if we can train a classifier that distinguishes two classes well. Concretely, let us take all the documents in the following classes:

Table 1: Two well-separated classes

Computer Technology
comp.graphics comp.os.ms-windows.misc comp.sys.ibm.pc.hardware comp.sys.mac.hardware
Recreational Activity
rec.autos                             rec.motorcycles                                    rec.sport.baseball                                rec.sport.hockey
Specifically, use the settings as the following code to load the data:

categories = ['comp.graphics', 'comp.os.ms-windows.misc',

'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware',

'rec.autos', 'rec.motorcycles',

'rec.sport.baseball', 'rec.sport.hockey']

train_dataset = fetch_20newsgroups(subset = 'train', categories = categories,

,→ shuffle = True, random_state = None) test_dataset = fetch_20newsgroups(subset = 'test', categories = categories,

,→ shuffle = True, random_state = None)
1        Feature Extraction
The primary step in classifying a corpus of text is choosing a proper document representation. A good representation should retain enough information that enable us to perform the classification, yet in the meantime, be concise to avoid computational intractability and over fitting.

One common representation of documents is called “Bag of Words”, where a document is represented as a histogram of term frequencies, or other statistics of the terms, within a fixed vocabulary. As such, a corpus of text can be summarized into a term-document matrix whose entries are some statistic of the terms.

First a common sense filtering is done to drop certain words or terms: to avoid unnecessarily large feature vectors (vocabulary size), terms that are too frequent in almost every document, or are very rare, are dropped out of the vocabulary. The same goes with special characters, common stop words (e.g. “and”, “the” etc.), In addition, appearances of words that share the same stem in the vocabulary (e.g. “goes” vs “going”) are merged into a single term.

Further, one can consider using the normalized count of the vocabulary words in each document to build representation vectors. A popular numerical statistic to capture the importance of a word to a document in a corpus is the “Term Frequency-Inverse Document Frequency (TF-IDF)” metric. This measure takes into account count of the words in the document, as normalized by a certain function of the frequency of the individual words in the whole corpus. For example, if a corpus is about computer accessories then words such as “computer” “software” “purchase” will be present in almost every document and their frequency is not a distinguishing feature for any document in the corpus. The discriminating words will most likely be those that are specialized terms describing different types of accessories and hence will occur in fewer documents. Thus, a human reading a particular document will usually ignore the contextually dominant words such as “computer”, “software” etc. and give more importance to specific words. This is like when going into a very bright room or looking at a bright object, the human perception system usually applies a saturating function (such as a logarithm or square-root) to the actual input values before passing it on to the neurons. This makes sure that a contextually dominant signal does not overwhelm the decision-making processes in the brain. The TF-IDF functions draw their inspiration from such neuronal systems. Here we define the TF-IDF score to be

tf-idf(d,t) = tf(t,d) × idf(t)

where tf(d,t) represents the frequency of term t in document d, and inverse document frequency is defined as:

idf(

where n is the total number of documents, and df(t) is the document frequency, i.e. the number of documents that contain the term t.

QUESTION 2: Use the following specs to extract features from the textual data:

•    Use the “english” stopwords of the CountVectorizer

•    Exclude terms that are numbers (e.g. “123”, “-45”, “6.7” etc.)

•    Perform lemmatization with nltk.wordnet.WordNetLemmatizer and pos tag

•    Use min df=3

Report the shape of the TF-IDF matrices of the train and test subsets respectively.

Please refer to the official documentation of CountVectorizer as well as the discussion section notebooks for details.

2        Dimensionality Reduction
After above operations, the dimensionality of the representation vectors (TF-IDF vectors) ranges in the order of thousands. However, learning algorithms may perform poorly in highdimensional data, which is sometimes referred to as “The Curse of Dimensionality”. Since the document-term TF-IDF matrix is sparse and low-rank, as a remedy, one can select a subset of the original features, which are more relevant with respect to certain performance measure, or transform the features into a lower dimensional space.

In this project, we use two dimensionality reduction methods: Latent Semantic Indexing (LSI) and Non-negative Matrix Factorization (NMF), both of which minimize mean squared residual between the original data and a reconstruction from its low-dimensional approximation. Recall that our data is the term-document TF-IDF matrix, whose rows correspond to TF-IDF representation of the documents, i.e.

tfidf(d1,t1) tfidf(d ,t )
···

···
tfidf(d1,tm) tfidf(d ,t )
2                   1                   2                  m  X...               ...            .               



                                                                              tfidf(dn,t1) ···       tfidf(

(which is the case for the output of CountVectorizer and TfidfTransformer).

LSI
The LSI representation is obtained by computing left and right singular vectors corresponding to the top k largest singular values of the term-document TF-IDF matrix X.

We perform SVD to the matrix X, resulting in X = UΣVT, U and V orthogonal. Let the singular values in Σ be sorted in descending order, then the first k columns of U and V are called Uk and Vk respectively. Vk consists of the principle components in the feature space.

Then we use (XVk) (which is also equal to (UkΣk)) as the dimension-reduced data matrix, where rows still correspond to documents, only that they can have (far) lower dimension. In this way, the number of features is reduced. LSI is similar to Principal Component Analysis (PCA), and you can see the lecture notes for their relationships.

Having learnt U and V, to reduce the test data, we just multiply the test TF-IDF matrix Xt by Vk, i.e. Xt,reduced = XtVk. By doing so, we actually project the test TF-IDF vectors to the principle components, and use the projections as the dimension-reduced data.

NMF
NMF tries to approximate the data matrix X ∈ Rn×m (i.e. we have n docs and m terms) with WH (W ∈ Rn×r, H ∈ Rr×m). Concretely, it finds the non-negative matrices W and H s.t.

kX − WHk2F is minimized. ( ) Then we use W as the dim-reduced data

matrix, and in the fit step, we calculate both W and H. The intuition behind this is that we are trying to describe the documents (the rows in X) as a (non-negative) linear combination of r topics:

X  WH
Here we see h “topics”, each of which consists of m scores, indicating how important each term is in the topic. Then x .

Now how do we calculate the dim-reduced test data matrix? Again, we try to describe the document vectors (rows by our convention here) in the test data (call it Xt) with (non-negative) linear combinations of the “topics” we learned in the fit step. The “topics”, again, are the rows of H matrix, . How do we do that? Just solve the optimization problem



where H is fixed as the H matrix we learned in the fit step. Then Wt is used as the dim-reduced version of Xt.

QUESTION 3: Reduce the dimensionality of the data using the methods above

•    Apply LSI to the TF-IDF matrix corresponding to the 8 categories with k = 50; so each document is mapped to a 50-dimensional vector.

•    Also reduce dimensionality through NMF (k = 50) and compare with LSI:

Which one is larger, the kX − WHk2F in NMF or the  in LSI?

Why is the case?

3        Classification Algorithms
In this part, you are asked to use the dimension-reduced training data from LSI to train (different types of) classifiers, and evaluate the trained classifiers with test data. Your task would be to classify the documents into two classes “Computer Technology” vs “Recreational Activity”. Refer to Table 1 to find the 4 categories of documents comprising each of the two classes. In other words, you need to combine documents of those sub-classes of each class to form the set of documents for each class.

Classification measures
Classification quality can be evaluated using different measures such as precision, recall, F-score, etc. Refer to the discussion material to find their definition.

Depending on application, the true positive rate (TPR) and the false positive rate (FPR) have different levels of significance. In order to characterize the trade-off between the two quantities, we plot the receiver operating characteristic (ROC) curve. For binary classification, the curve is created by plotting the true positive rate against the false positive rate at various threshold settings on the probabilities assigned to each class (let us assume probability p for class 0 and 1 − p for class 1). In particular, a threshold t is applied to value of p to select between the two classes. The value of threshold t is swept from 0 to 1, and a pair of TPR and FPR is got for each value of t. The ROC is the curve of TPR plotted against FPR.

SVM
Linear Support Vector Machines have been proved efficient when dealing with sparse high dimensional datasets, including textual data. They have been shown to have good generalization accuracy, while having low computational complexity.

Linear Support Vector Machines aim to learn a vector of feature weights, w, and an intercept, b, given the training dataset. Once the weights are learned, the label of a data point is determined by thresholding wTx + b with 0, i.e. sign(wTx + b). Alternatively, one produce probabilities that the data point belongs to either class, by applying a logistic function instead of hard thresholding, i.e. calculating σ(wTx + b).

The learning process of the parameter w and b involves solving the following optimization problem:

n

s.t. yi(wTxi + b) ≥ 1 − ξi ξi ≥ 0, ∀i ∈ {1,...,n}

where xi is the ith data point, and yi ∈ {0,1} is the class label of it.

Minimizing the sum of the slack variables corresponds to minimizing the loss function on the training data. On the other hand, minimizing the first term, which is basically a regularization, corresponds to maximizing the margin between the two classes. Note that in the objective function, each slack variable represents the amount of error that the classifier can tolerate for a given data sample. The tradeoff parameter γ controls relative importance of the two components of the objective function. For instance, when 1, misclassification of individual points is highly penalized, which is called “Hard Margin SVM”. In contrast, a “Soft Margin SVM”, which is the case when 1, is very lenient towards misclassification of a few individual points as long as most data points are well separated.

QUESTION 4: Hard margin and soft margin linear SVMs:

•    Train two linear SVMs and compare:

–    Train one SVM with γ = 1000 (hard margin), another with γ = 0.0001 (soft margin).

–    Plot the ROC curve, report the confusion matrix and calculate the accuracy, recall, precision and F-1 score of both SVM classifier. Which one performs better?

–    What happens for the soft margin SVM? Why is the case?

∗ Does the ROC curve of the soft margin SVM look good? Does this conflict with other metrics?

•    Use cross-validation to choose γ (use average validation accuracy to compare):

Using a 5-fold cross-validation, find the best value of the parameter γ in the range {10k|−3 ≤ k ≤ 3,k ∈ Z}. Again, plot the ROC curve and report the confusion matrix and calculate the accuracy, recall precision and F-1 score of this best SVM.

Logistic Regression
Although its name contains “regression”, logistic regression is a probability model that is used for binary classification.

In logistic regression, a logistic function (σ(φ) = 1/(1+exp(−φ))) acting on a linear function of the features (φ(x) = wTx+b) is used to calculate the probability that the data point belongs to class 1, and during the training process, w and b that maximizes the likelihood of the training data are learnt.

One can also add regularization term in the objective function, so that the goal of the training process is not only maximizing the likelihood, but also minimizing the regularization term, which is often some norm of the parameter vector w. Adding regularization helps prevent ill-conditioned results and over-fitting, and facilitate generalization ability of the classifier. A coefficient is used to control the trade-off between maximizing likelihood and minimizing the regularization term.

QUESTION 5: Logistic classifier:

•    Train a logistic classifier without regularization (you may need to come up with some way to approximate this if you use sklearn.linear model.LogisticRegression); plot the ROC curve and report the confusion matrix and calculate the accuracy, recall precision and F-1 score of this classifier.

•    Regularization:

–    Using 5-fold cross-validation on the dimension-reduced-by-svd training data, find the best regularization strength in the range {10k|−3 ≤ k ≤ 3,k ∈ Z} for logistic regression with L1 regularization and logistic regression L2 regularization, respectively.

–    Compare the performance (accuracy, precision, recall and F-1 score) of 3 logistic classifiers: w/o regularization, w/ L1 regularization and w/ L2 regularization (with the best parameters you found from the part above), using test data.

–    How does the regularization parameter affect the test error? How are the learnt coefficients affected? Why might one be interested in each type of regularization?

–    Both logistic regression and linear SVM are trying to classify data points using a linear decision boundary, then what’s the difference between their ways to find this boundary? Why their performance differ?

Na¨ıve Bayes
Scikit-learn provides a type of classifiers called “na¨ıve Bayes classifiers”. They include MultinomialNB, BernoulliNB, and GaussianNB.

Na¨ıve Bayes classifiers use the assumption that features are statistically independent of each other when conditioned by the class the data point belongs to, to simplify the calculation for the Maximum A Posteriori (MAP) estimation of the labels. That is,

                                             P(xi | y,x1,...,xi−1,xi+1,...,xm) = P(xi | y)                           i ∈ {1,...,m}

where xi’s are features, i.e. components of a data point, and y is the label of the data point.

Now that we have this assumption, a probabilistic model is still needed; the difference between MultinomialNB, BernoulliNB, and GaussianNB is that they use different models.

QUESTION 6: Na¨ıve Bayes classifier: train a GaussianNB classifier; plot the ROC curve and report the confusion matrix and calculate the accuracy, recall, precision and F-1 score of this classifier.

Grid Search of Parameters
Now we have gone through the complete process of training and testing a classifier. However, there are lots of parameters that we can tune. In this part, we fine-tune the parameters.

QUESTION 7: Grid search of parameters:

•    Construct a Pipeline that performs feature extraction, dimensionality reduction and classification;

•    Do grid search with 5-fold cross-validation to compare the following (use test accuracy as the score to compare):

Table 2: Options to compare

Procedure
Options
Loading Data
remove “headers” and “footers” vs not
Feature Extraction
min df = 3 vs 5;

use lemmatization vs not
Dimensionality Reduction
LSI vs NMF
Classifier
SVM with the best γ previously found

vs

Logistic Regression: L1 regularization vs L2 regularization, with the best regularization strength previously found vs

GaussianNB
Other options
Use default
•    What is the best combination?

Hint: see

http://scikit-learn.org/stable/auto_examples/plot_compare_reduction.html and http://www.davidsbatista.net/blog/2018/02/23/model_optimization/

Multiclass Classification
So far, we have been dealing with classifying the data points into two classes. In this part, we explore multiclass classification techniques through different algorithms.

Some classifiers perform the multiclass classification inherently. As such, na¨ıve Bayes algorithm finds the class with maximum likelihood given the data, regardless of the number of classes. In fact, the probability of each class label is computed in the usual way, then the class with the highest probability is picked; that is

cˆ= argminP(c | x)

c∈C

where c denotes a class to be chosen, and ˆc denotes the optimal class.

For SVM, however, one needs to extend the binary classification techniques when there are multiple classes. A natural way to do so is to perform a one versus one classification on all

 pairs of classes, and given a document the class is assigned with the majority vote.

In case there is more than one class with the highest vote, the class with the highest total classification confidence levels in the binary classifiers is picked.

An alternative strategy would be to fit one classifier per class, which reduces the number of classifiers to be learnt to |C|. For each classifier, the class is fitted against all the other classes. Note that in this case, the unbalanced number of documents in each class should be handled. By learning a single classifier for each class, one can get insights on the interpretation of the classes based on the features.

QUESTION 8: In this part, we aim to learn classifiers on the documents belonging to the classes:

comp.sys.ibm.pc.hardware, comp.sys.mac.hardware, misc.forsale, soc.religion.christian

Perform Na¨ıve Bayes classification and multiclass SVM classification (with both One VS One and One VS the rest methods described above) and report the confusion matrix and calculate the accuracy, recall, precision and F-1 score of your classifiers.

More products