Starting from:

$30

CSI5180-Building a paraphrase classifier Solved

GOALS

Given the important role that paraphrases play in Virtual Assistant, and in Natural Language Processing in general, the purpose of this assignment is to:

•       further explore paraphrases and their variations

•       get hands-on experience with a paraphrase dataset

•       program a simple paraphrase classifier

•       follow a scientific methodology  

Even if most intent classification approaches do not include an explicit paraphrase detection step, I find it important to do an assignment in which you will explore paraphrases.  It will make you look at real data and better understand the main challenges in language understanding, the fact that the same intent (or idea) can be conveyed in so many different ways.

 
 

        •    In Brightspace, there is a link for Assignment 2 to do your submission

•    Submit a short report (5 pages max excluding your title page) in PDF format, describing your approach, your experiment, and your results 
•    If you work in groups of 2, make sure both students submit the same report, and that both your names are on the top page.

 

   INSTRUCTIONS  

 

EXPLORE THE DATASET 
1.       Download the dataset SemEval-PIT2015 from this site: https://github.com/cocoxu/SemEvalPIT2015.  The dataset was used for an international competition at SemEval 2015, called Paraphrase and Semantic Similarity in Twitter.

2.       Explore the format of the dataset.  From the github site we see (as in figure below) that the data is tab separated and contains not only the raw sentences, but also the POS tags and NE tags.   There is also an indicator (label) of whether the two sentences are paraphrases or not, as judged by 5 judges.  Look into the file train.data or dev.data (within SemEval-PIT2015github.zip).
 

 

3.       The number of sentences in the dataset is quite large.  If you are not too comfortable with programming with large datasets, you can just take a sample (let say 800 train, 100 dev, 100 test) to work with instead of the full dataset, which has train (13063 sentences), dev (4727 sentences) and test (972 sentences).

 
                  

FOLLOW A SCIENTIFIC METHODOLOGY 
You are free to develop any sentence comparison approaches you want.  The purpose is to have as input two sentences and as output a yes/no classification for paraphrase or not*.  I want you to follow a scientific approach, as given below:

1.       Develop a baseline algorithm (Algo A) to paraphrase detection (could be as simple as full exact string match for Yes and No otherwise).   

2.       Evaluate the results of your baseline on the Dev Set.  Calculate recall/precision measures.

3.       Develop another approach (Algo B).  As one idea, your approach could be based on edit distance** including various penalties for different POS.  That would be an unsupervised approach.  OR, if you are familiar with binary classifiers, such as the ones in sklearn for Python, or other packages for Java, you can use such classifiers.   

4.       Evaluate the results of Algo B on the Dev Set.

5.       Redo steps 3 and 4 to improve on Dev Set.  This means to modify your Algo B slightly, or think of another algorithm, and retest.

6.       Choose the best method you have developed and evaluate on Test Set.

ATTENTION:  You might use the TRAIN data if you build a supervised model.  If not (if you code rules), you do not need the TRAIN data.  I will not require a supervised model.  I want you to follow a scientific method, and program to the level of your actual programming competency.  The purpose of this assignment is not complex programming, it is (a) an exploration of paraphrases and (b) an exercise in following a scientific method.

* The labels provided are not directly “yes”/”no” for paraphrases or not paraphrases.  In the figure above, it says to use the results (5,0) (4,1) (3,2) as “yes” meaning respectively that 5, 4, 3 judges said paraphrase while 0, 1, 2 said not-paraphrases.  You can decide something different, or even program a multi-class classifier if you prefer. 

** When we discussed Speech Recognition, you explored WER (Word Error Rate) in your assignment.  WER is a flavor of Edit Distance which looks at insertion, deletion and substitution. 

 

 

•       You are free to use any programming language you want.

•       The programming does not need to be complex.  I am interested in you following a scientific approach more than the results you get.  Still, if you are at ease with various deep learning models, like BERT, which we discussed in class, and want to use them, that would be good hands-on practice for you with such models.  But again, this is not a large project, it is rather an assignment.

 

 

More products