$30
Write a program that splits a document into sentences. The input to your program should be
a file containing text. The output should be a new file with each sentence from the first file on
a separate line.
For example, if the input file contains the following:
With all the fawning end-of-the-year kudos currently circulating, it’s easy to forget that a
sizable number of actual bad movies came out in 2012. Well, consider this a refresher! From
failed blockbuster tentpoles (”Battleship”) to would-be hilarious comedies (“The Watch”) to
lame scare-challenged horror flicks (“The Apparition”) to...uh, well, pretty much anything
involving Mr. Tyler Perry, there’s no doubt that the last 366 days have come with a heaping
helping of truly heinous cinematic stinkers. So what better time for an accounting of the
year’s most outrageous big-screen abominations than on the eve of the coming apocalypse?
The output file should contain the following:
With all the fawning end-of-the-year kudos currently circulating, it’s easy to forget that a
sizable number of actual bad movies came out in 2012.
Well, consider this a refresher!
From failed blockbuster tentpoles (”Battleship”) to would-be hilarious comedies (“The
Watch”) to lame scare-challenged horror flicks (“The Apparition”) to...uh, well, pretty much
anything involving Mr. Tyler Perry, there’s no doubt that the last 366 days have come with a
heaping helping of truly heinous cinematic stinkers.
So what better time for an accounting of the year’s most outrageous big-screen
abominations than on the eve of the coming apocalypse?
Note that your solution should NOT make use of machine learning.
PART TWO (5 marks)
Language Modelling
Implement an unsmoothed bigram language model. Train your model on the following toy
corpus:<s> a b </s>
<s> b b </s>
<s> b a </s>
<s> a a </s>
Calculate and print out the probability of each of the following strings:
<s> b </s>
<s> a </s>
<s> a b </s>
<s> a a </s>
<s> a b a </s>
PART THREE (10 MARKS)
Naive Bayes Sentiment Polarity Classifier
Write a sentiment polarity classifier which uses the Naive Bayes algorithm to train a
sentiment polarity classifier which assigns a sentiment polarity of positive or negative
to a review.
Your program should accept as input a training file and a test file. The training file contains a
list of reviews and their actual sentiment labels ( positive or negative). The test file
contains either a list of reviews with the actual sentiment labels or list of the reviews on their
own. Your program should output the predictions of the NB classifier (positive or
negative)for each of the reviews in the test file. If the actual labels (sometimes referred to
as gold labels or ground truth) are also available for the test reviews, your program should
also print the accuracy of the classifier.
You should use the following training data:
https://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz
described in the following paper:
Pang and Lee 2004. A Sentimental Education: Sentiment Analysis Using Subjectivity
Summarization Based on Minimum Cuts. Proceedings of the 42nd ACL.
https://www.aclweb.org/anthology/P04-1035/
There are 1000 positive reviews and 1000 negative reviews. Reserve the last 100 of each
type for testing (files starting with CV9) and the first 900 for training (files starting with
CV[0-8]).
Analyse the output of your classifier on 5 correct and 5 incorrect samples chosen at random
from the test set. For each example, say why you think your classifier made the correct or
incorrect decision.Points to Note
● You may implement the solutions in a programming language of your choice.
● Note that you may NOT make use of external NLP libraries.
M