Starting from:

$30

CS4395 Homework 8 -Solved

 
1. Read in the csv file using pandas. Convert the author column to categorical data. Display the first 
few rows. Display the counts by author. 
2. Divide into train and test, with 80% in train. Use random state 1234. Display the shape of train and 
test. 
3. Process the text by removing stop words and performing tf-idf vectorization, fit to the training data 
only, and applied to train and test. Output the training set shape and the test set shape. 
4. Try a Bernoulli Naïve Bayes model. What is your accuracy on the test set? 
5. The results from step 4 will be disappointing. The classifier just guessed the predominant class, 
Hamilton, every time. Looking at the train data shape above, there are 7876 unique words in the 
vocabulary. This may be too much, and many of those words may not be helpful. Redo the 
vectorization with max_features option set to use only the 1000 most frequent words. In addition to 
the words, add bigrams as a feature. Try Naïve Bayes again on the new train/test vectors and 
compare your results. 
6. Try logistic regression. Adjust at least one parameter in the LogisticRegression() model to see if you 
can improve results over having no parameters. What are your results? 
7. Try a neural network. Try different topologies until you get good results. What is your final 
accuracy? 

More products