$30
In this competition, you have to perform a sentiment analysis task, analyzing user’s textual reviews, to understand if a comment includes a positive or negative mood.
In practice, you are required to build a robust classification model that is able to predict the sentiment contained in a text.
1.1 Dataset
The dataset for this competition has been specifically scraped from the tripadvisor.it Italian web site. It contains 41077 textual reviews written in the Italian language.
The dataset is provided as textual files with multiple lines. Each line is composed of two fields: text and class. The text field contains the review written by the user, while the class field contains a label that can get the following values:
• pos: if the review shows a positive sentiment.
• neg: if the review shows a negative sentiment.
Dataset tree hierarchy The data have been distributed in two separate collections. Each collection is in a different file.
The dataset archive is organized as follows:
• development.csv (Development set): a collection of reviews with the class column. This collection of data has to be used during the development of the classification model.
• evaluation.csv (Evaluation set): a collection of reviews without the class column. This collection of data has to be used to produce the submission file.
• sample_submission.csv: a sample submission file.
The dataset is located at:
http://dbdmg.polito.it/wordpress/wp-content/uploads/2020/01/dataset_winter_2020.zip
1
1.2 Task
You are required to build a classification pipeline to assign a label to each record in the Evaluation set. The label specifies the sentiment of the review.