$25
1 Utilities
File name: utilities.py
Implement: You will implement three functions listed here and detailed below.
def generate_vocab(dir, min_count, max_files) def create_word_vector(fname, vocab) def load_data(dir, vocab, max_files) Write-Up: Describe your implementation concisely.
def generate_vocab(dir, min_count, max_files)
This function will take a starting directory dir (e.g. “aclImdb/train”), min_count which is the minimum number of times you want to see a word before adding it to a vocabulary. If min_count=2 then we only want to consider words that have been seen 2 or more times in the dataset as a part of our vocabulary. This
1
function also takes a parameter max_files which is just for implementation purposes. This allows you to do small tests without using the full dataset. The full dataset takes a long time to generate feature vectors, so you may want to use only 200 files to start with. If max_files= -1 then all files are used. This returns a list or numpy array of the vocabulary. Remember that when using max_files, you should be sure to grab an even number of positive and negative samples.
def create_word_vector(fname, vocab)
This function takes the vocabulary and a review file and generates a feature vector. fname is the filename for a review. Assume that the aclImdb directory is in the same directory as the test script.py. This returns one feature vector.
def load_data(dir, vocab, max_files)
This function loads the data, returning a set of feature vectors and associated labels. It should return two lists/arrays X, Y. max_files is again for implementation reasons, to allow for smaller tests. If max_files= -1 then all files are used.
2 ML
File name: ml.py
You will use sklearn to test many different models on the ACL IMDB data.
Implement: You will implement the functions listed here and detailed below.
def dt_train(X,Y) def kmeans_train(X) def knn_train(X,Y,K) def perceptron_train(X,Y) def nn_train(X,Y, hls) def pca_train(X,K) def pca_transform(X,pca) def svm_train(X,Y,k) def model_test(X,model) def compute_F1(Y, Y_hat)
All of the _train— functions are very similar. They take data, in the form of feature vectors and labels (except K-Means and PCA) and produce models. The models are then passed to model_test to make predictions on a set of test data. knn_train has a K which is the number of neighbors to be considered. nn_train has hls which stands for hidden layer size. This can be a tuple of any size, depending on how many layers you want (e.g. (5,2) or (3)). pca_train has a K for the number of principal components to keep in learning the transformation. pca_transform applies the learned transformation pca to the data. svm_train has a k which stands for kernel. This is a string. See the sklearn documentation for this. compute_F1 should take the labels and the predictions and compute the F1 score. Write-Up: Describe your implementations concisely.
The result of the test script gives the following results:
Decision Tree : 0.6250000000000001
Decision Tree + PCA: 0.5242718446601942
KMeans: 0.2769230769230769
KMeans + PCA: 0.2769230769230769 KNN: 0.576
KNN + PCA: 0.5736434108527131
Perceptron : 0.6095238095238096
Perceptron + PCA: 0.5454545454545454
Neural Network : 0.45161290322580644
Neural Network + PCA: 0.5000000000000001
SVM: 0.6222222222222222
SVM + PCA: 0.583941605839416
1
2
3
4
5
6
7
8
9 10
11
12