Starting from:

$25

CS7401 - Introduction to NLP |  -Assignment 1 - Solved

1          Tokenization
You have been given a twitter corpus for cleaning. Your task is to design a tokenizer using regex, which you will later use for smoothing as well.

1.    Create a Tokenizer to handle following cases:

(a)     Word Tokenizer

(b)     Punctuation

(c)      URLs

(d)     Hashtags (#manchesterisred)

(e)     Mentions (@john)

2.    For the following cases, replace the tokens with appropriate placeholders:

(a)     URLs: <URL>

(b)     Hashtags: <HASHTAG>

(c)      Mentions: <MENTION>

   OriginalCleaned

#ieroween THE STORY OF IEROWEEN! THE<HASHTAG> THE STORY OF IEROWEEN! VIDEO -»»»»»»»»»»» http://bit.ly/2VFPAV ««THE VIDEO -> <URL> < JUST FOR FRANK

JUST FOR FRANK !!! ç! ç

Example output after placeholder substitution

Apart from these, you are also encouraged to try other tokenization and placeholder substitution schemes based on your observations from the corpora used for the smoothing task to achieve a better language model. You’ll find percentages, age values, expressions indicating time, time periods occurring in the data. You’re free to explore and add multiple such reasonable tokenization schemes in addition from the ones listed above. Specify any such schemes you use in the final README.

2          Smoothing
You have been given two corpus : EuroParl corpus, and Medical Abstracts corpus. Your task is to design Language Models for both of these corpora using smoothing. Ensure that you use the tokenizer created in task 1 for this task.

1.    Create language models with the following parameters:

(a)     On EuroParl corpus:

i. LM 1: tokenization + 4-gram LM + Kneyser-Ney smoothing

ii.                       LM 2: tokenization + 4-gram LM + Witten-Bell smoothing (b) On Medical Abstracts corpus:

i.      LM 3: tokenization + 4-gram LM + Kneyser-Ney smoothing

ii.    LM 4: tokenization + 4-gram LM + Witten-Bell smoothing

2.    For each of these corpora, create a test set by randomly selecting 1000 sentences. This set will not be used for training the LM.

(a)     Calculate perplexity score for each sentence of EuroParl corpus and Medical Abstracts corpus for each of the above models and also get average perplexity score/corpus/LM on the train corpus

(b)     Report the perplexity scores for all the sentences in the test set. Report the perplexity score on the test sentences as well, in the same manner above.

3.    Compare and analyse the behaviour of the different LMs and put your analysis and visualisation in

(a)     report

3          Corpus
The following corpora have been given to you for training:

1.    Tweets corpus (A set of 2000 tweets)

2.    Europarl Corpus (A subset of 20000 lines from English side of the dataset for English-French translation)

3.    Medical Abstracts Corpus (A subset of 20000 sentences from English side of the English-German translation task for medical abstracts in WMT 2014)

The first one is to be used for figuring out rules for the tokenization task. Use 2 & 3 for training your LMs. Please download the corpus files from this link.

More products