Starting from:

$30

CA4023  Assignment 1 -Solved

Write a program that splits a document into sentences. The input to your program should be 
a file containing text. The output should be a new file with each sentence from the first file on 
a separate line. 
For example, if the input file contains the following: 
With all the fawning end-of-the-year kudos currently circulating, it’s easy to forget that a 
sizable number of actual bad movies came out in 2012. Well, consider this a refresher! From 
failed blockbuster tentpoles (”Battleship”) to would-be hilarious comedies (“The Watch”) to 
lame scare-challenged horror flicks (“The Apparition”) to...uh, well, pretty much anything 
involving Mr. Tyler Perry, there’s no doubt that the last 366 days have come with a heaping 
helping of truly heinous cinematic stinkers. So what better time for an accounting of the 
year’s most outrageous big-screen abominations than on the eve of the coming apocalypse? 
The output file should contain the following: 
With all the fawning end-of-the-year kudos currently circulating, it’s easy to forget that a 
sizable number of actual bad movies came out in 2012. 
Well, consider this a refresher! 
From failed blockbuster tentpoles (”Battleship”) to would-be hilarious comedies (“The 
Watch”) to lame scare-challenged horror flicks (“The Apparition”) to...uh, well, pretty much 
anything involving Mr. Tyler Perry, there’s no doubt that the last 366 days have come with a 
heaping helping of truly heinous cinematic stinkers. 
So what better time for an accounting of the year’s most outrageous big-screen 
abominations than on the eve of the coming apocalypse? 
Note that your solution should NOT make use of machine learning. 
PART TWO (5 marks) 
Language Modelling 
Implement an unsmoothed bigram language model. Train your model on the following toy 
corpus:<s> a b </s> 
<s> b b </s> 
<s> b a </s> 
<s> a a </s> 
Calculate and print out the probability of each of the following strings: 
<s> b </s> 
<s> a </s> 
<s> a b </s> 
<s> a a </s> 
<s> a b a </s> 
PART THREE (10 MARKS) 
Naive Bayes Sentiment Polarity Classifier 
Write a sentiment polarity classifier which uses the Naive Bayes algorithm to train a 
sentiment polarity classifier which assigns a sentiment polarity of positive or negative 
to a review. 
Your program should accept as input a training file and a test file. The training file contains a 
list of reviews and their actual sentiment labels ( positive or negative). The test file 
contains either a list of reviews with the actual sentiment labels or list of the reviews on their 
own. Your program should output the predictions of the NB classifier (positive or 
negative)for each of the reviews in the test file. If the actual labels (sometimes referred to 
as gold labels or ground truth) are also available for the test reviews, your program should 
also print the accuracy of the classifier. 
You should use the following training data: 
https://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz 
described in the following paper: 
Pang and Lee 2004. A Sentimental Education: Sentiment Analysis Using Subjectivity 
Summarization Based on Minimum Cuts. Proceedings of the 42nd ACL. 
https://www.aclweb.org/anthology/P04-1035/ 
There are 1000 positive reviews and 1000 negative reviews. Reserve the last 100 of each 
type for testing (files starting with CV9) and the first 900 for training (files starting with 
CV[0-8]). 
Analyse the output of your classifier on 5 correct and 5 incorrect samples chosen at random 
from the test set. For each example, say why you think your classifier made the correct or 
incorrect decision.Points to Note 
● You may implement the solutions in a programming language of your choice. 
● Note that you may NOT make use of external NLP libraries. 
M

More products