Overview: Experiments with LDA
In this programming project we implement Latent Dirichlet Allocation (LDA) and inspect its performance both in an unsupervised manner and when used as a preprocessing step for supervised learning. Your goals in this assignment are to (i) implement the collapsed Gibbs sampler for LDA inference, and (ii) compare the LDA topic representation to a “bag-of-words” representation with respect to how well they support document classification.
Data
Data for this assignment is provided in a zip file pp4data.zip on Canvas. Each dataset is given in a separate sub-directory.
We will use a subset of the well known 20 newsgroups dataset. The subset consists of 200 documents that have been pre-processed and “cleaned” so they do not require any further manipulation, i.e., you only have to read the space-separated strings from the ASCII text files. Each document belongs to one of two classes. The file index.csv holds the true labels of each document. Labels are ignored in task 1 but used in task 2.
We have prepared an additional small dataset, artificial, for developing your implementation. Running your sampler on this dataset with K = 2 and parameters as below, you should find that the three most frequent words in the two topics are {bank, river, water} and {dollars, bank, loan} (not necessarily in those orders).
Task 1: Gibbs Sampling
In this portion, your task is to implement the collapsed Gibbs sampler for LDA. In the case of LDA, the output represents a sample of the (hidden) topic variables for each word. Recall that in LDA we sample the hidden topic variables associated with words in the text. This sample of topic variables can be used to calculate topic representations per document. Algorithm 1 describes one possible implementation of the collapsed Gibbs sampler.
Algorithm 1 Collapsed Gibbs sampler for LDA
Require: Number of topics K, Dirichlet parameter for topic distribution α, Dirichlet parameter for word distribution β, number of iterations to run sampler Niters, array of word indices w(n), array of document indices d(n), and array of initial topic indices z(n), where n = 1...Nwords and Nwords is the total amount of words in the corpus.
1: Generate a random permutation π(n) of the set {1,2,...,Nwords}
2: Initialize a D×K matrix of topic counts per document Cd, where D is the number of documents
3: Initialize a K × V matrix of word counts per topic Ct, where V is the number of words in the vocabulary
4: Initialize a 1 × K array of probabilities P (to zero)
5: for i = 1 to Niters do 6: for n = 1 to Nwords do
7: word ← w(π(n))
8: topic ← z(π(n))
9: doc ← d(π(n))
10: Cd(doc,topic) ← Cd(doc,topic) − 1
11: Ct(topic,word) ← Ct(topic,word) − 1
12: for k = 1 to K do
13: P(k) = V β Ct+(k,wordPj Ct)+(k,jβ) KαC+dP(doc,kl Cd)+(doc,lα )
14: end for
15: P ← normalize P
16: topic ← sample from P
17: z(π(n)) ← topic
18: Cd(doc,topic) ← Cd(doc,topic) + 1
19: Ct(topic,word) ← Ct(topic,word) + 1
20: end for
21: end for
22: return {z(n)},Cd,Ct
For this project, fix the number of iterations to run the sampler at Niters = 500. The Dirichlet parameter for the topic distribution is α1 where 1 is a vector of ones with K entries (K is the number of topics), and α = K5 . The Dirichlet parameter for the word distribution is β1 where 1 is a vector of ones with V entries (V is the size of the vocabulary), and β = 0.01.
We suggest testing your implementation first on the artificial dataset with K = 2 because you know what to expect and the run time is shorter. Once you have verified that your implementation works correctly, run your sampler with K = 20 on the 20 newsgroups dataset. After the sampler has finished running, output the 5 most frequent words of each topic into a CSV file, topicwords.csv, where each row represents a topic. Include these results in both your report and submission. In your report discuss the results obtained (i.e., the topics). Do the topics obtained make sense for the dataset?
Finally, you will need the topic representations for the next part. For a document doc, this will be a vector of K values, one for each topic, where the kth value is given by and Cd is output from the sampler.
Task 2: Classification
In this portion we will evaluate the dimensionality reduction accomplished by LDA in its ability to support document classification and compare it to the bag of words representation.
The first step is to prepare the data files for the two representations. The first is given by the topic representation of the previous section, where each document is represented by a feature vector of length K. The second representation is the “bag-of-words” representation. This representation has a feature for each word in the vocabulary and the value of this feature is the number of occurrences of the corresponding word in the document.
For the evaluation we will reuse the logistic regression implementation from project 3 (that you developed as part of GLM). You should use the value α = 0.01 for the regularization parameter of logistic regression in this part.
Your task is to generate learning curves in the same way you did there: Step 1) Set aside 1/3 of the total data (randomly selected) to use as a test set. Step 2) Record performance as a function of increasing training set size (with each training set randomly selected from the other 2/3 of the total data). Repeat Steps 1 & 2 a total of 30 times to generate learning curves with error bars (i.e., ±1σ). Performance is defined as classification accuracy on the test set.
Plot the learning curve performance of the logistic regression algorithm (with error bars) on the two representations. Then discuss your observations on the results obtained.
Additional Notes
• As in previous projects you may use I/O and math libraries (e.g., from numpy) but you should implement all machine learning portions of the algorithms yourself.
• Please submit the logistic regression implementation with this project so that your code can be used and tested without further manipulation.
• The run time for this project is non-negligible. A Matlab / Python implementation of the collapsed Gibbs sampler for the 20 newsgroups dataset might take 10 or more minutes to run and might be longer if your implementation is not optimized. The many runs of logistic regression for the learning curves also require some time.