$25
1. Data cleaning
The tweets, as given, are not in a form amenable to analysis –-- there is too much ‘noise’.
Therefore, the first step is to “clean” the data. Design a procedure that prepares the
Twitter data for analysis by satisfying the requirements below. o All html tags and attributes (i.e., /<[^]+/) are removed. o Html character codes (i.e., &...;) are replaced with an ASCII equivalent. o All URLs are removed. o All characters in the text are in lowercase. o All stop words are removed. Be clear in what you consider as a stop word. o If a tweet is empty after pre-processing, it should be preserved as such.
2. Exploratory analysis
o Design a simple procedure that determines the political party (Liberal, Conservatives or New Democratic Party (NDC)) of a given tweet and apply this procedure to all the tweets in the Canadian Elections dataset. A suggestion would be to look at relevant words and hashtags in the tweets that identify to certain political parties or candidates. What can you say about the distribution of the political affiliations of the tweets?
o Present a graphical figure (e.g. chart, graph, histogram, boxplot, word cloud, etc.) that visualizes some aspect of the generic tweets in sentiment_analysis.csv and another figure for the 2019 Canadian Elections tweets. All graphs and plots should be readable and have all axes that are appropriately labelled.
3. Model preparation :
Split the generic tweets randomly into training data (70%) and test data (30%).
Prepare the data to try seven classification algorithms -- logistic regression, k-NN, Naive Bayes, SVM, decision trees, Random Forest and XGBoost, where each tweet is considered a single observation/example. In these models, the target variable is the sentiment value, which is either positive or negative. Try two different types of features, Bag of Words (word frequency) and TF-IDF on all 7 models. (Hint: Be careful about
when to split the dataset into training and testing set.)
4. Model implementation and tuning
Train models on the training data from generic tweets and apply the model to the test data to obtain an accuracy value. Evaluate the same trained model with best performance on the Canadian Elections data. How well do your predictions match the sentiment labelled in the Canadian elections data?
Choose the model that has the best performance and visualize the sentiment prediction results and the true sentiment for each of the 3 parties/candidates. Discuss whether NLP analytics based on tweets is useful for political parties during election campaigns.
Split the negative Canadian elections tweets into training data (70%) and test data
(30%). Use the true sentiment labels in the Canadian elections data instead of your predictions from the previous part. Choose three algorithms from classification algorithms (choose any 3 from logistic regression, k-NN, Naive Bayes, SVM, decision trees, ensembles (RF, XGBoost)), train multi-class classification models to predict the reason for the negative tweets. Tune the hyperparameters and chose the model with best score to test your prediction reason for negative sentiment tweets. There are 5 different negative reasons labelled in the dataset.
Feel free to combine similar reasons into fewer categories as long as you justify your reasoning. You are free to define input features of your model using word frequency analysis or other techniques.
5. Results
Answer the research question stated above based on the outputs of your first model. Describe the results of the analysis and discuss your interpretation of the results. Explain how each party is viewed in the public eye based on the sentiment value. For the second model, based on the model that worked best, provide a few reasons why your model may fail to predict the correct negative reasons. Back up your reasoning with examples from the test sets. For both models, suggest one way you can improve the accuracy of your models.