$30
The objective of this project is to apply NLP techniques in order to improve Arabic dialect identification on user generated content dataset.
In this assignment, you are provided with a large-scale collection of parallel sentences in the travel domain covering the dialects of 25 cities from the Arab World plus Modern Standard Arabic (MSA). The task is to build systems that predict a dialect class among one of the 26 labels (25+ MSA) for given sentences.
Dataset
The data of this assignment is the same reported on in the following papers.
Bouamor, H., Habash, N., Salameh, M., Zaghouani, W., Rambow, O., et al. (2018). The
MADAR Arabic Dialect Corpus and Lexicon. In Proceedings of the 11th International Conference on Language Resources and Evaluation. (PDF: http://www.lrecconf.org/proceedings/lrec2018/pdf/351.pdf)
Salameh, M., Bouamor, H. & Habash, N. (2018). Fine-Grained Arabic Dialect Identification. In Proceedings of the 27th International Conference on Computational Linguistics. (PDF:
http://aclweb.org/anthology/C18-1113)
Systems’ Evaluation Details
Evaluation will be done by calculating microaveraged F1 score (F1µ) for all dialect classes on the submissions made with predicted class of each sample in the Test set. To be precise, we define the scoring as following:
Pµ = ΣTPi / Σ(TPi + FPi) i {Happy,Sad,Angry}
Rµ = ΣTPi / Σ(TPi + FNi) i {Happy,Sad,Angry}
where TPi is the number of samples of class i which are correctly predicted, FNi and FPi are the counts of Type-I and Type-II errors respectively for the samples of class i.
The final metric F1µ will be calculated as the harmonic mean of Pµ and Rµ.
Implementation
Step1 : Data preprocessing
Download the Training and Development Data Canvas. And pre-process the dataset, i.e. cleaning, tokenization, etc.
You can find an Arabic text tokenizer (Farasa) within the Project_1 folder.
You can, also, use Camel Tools (already installed in C127 machines).
https://github.com/CAMeL-Lab/camel_tools
Step2 : System implementation
Your task is to implement and compare three different text classification methods as follows.
1- Feature-Based Classification for Dialectal Arabic
Use and compare two different feature-based classification methods (classical Machine Learning techniques) in order to implement your Arabic dialect identification system. Your models should apply various n-gram features as follows:
• Word-gram features with uni-gram, bi-gram and tri-gram;
• Character-gram features with/without word boundary consideration, from bi-gram and up to 5-gram.
2- LSTM Deep Network
Use the Long Short Term Memory (LSTM) architecture with AraVec pre-trained word embeddings models. These models were built using gensim Python library. Here's the steps for using the tool:
1. Install gensim = 3.4 and nltk = 3.2 using either pip or conda
pip install gensim nltk conda install gensim nltk
2. extract the compressed model files to a directory [ e.g. Twittert-CBOW ]
3. keep the .npy files. You are gonna to load the file with no extension, like what you'll see in the following figure.
4. run the python code to load and use the model
You can find a simple code for loading and using one the models by following these steps in the following link:
3- BERT for Dialectal Arabic
BERT1 or Bidirectional Encoder Representations from Transformers (BERT), has recently been introduced by Google AI Language researchers (Devlin et al., 2018). It replaces the sequential nature of RNN (LSTM & GRU) with a much faster Attention-based approach. The model is also pre-trained on two unsupervised tasks, masked language modeling and next sentence prediction. This allows you to use a pre-trained BERT model by fine-tuning the same on downstream specific tasks such as Dialectal Arabic classification.
You can employ the multi-lingual BERT that has been pre-trained on MSA and then fine tune it for dialectal Arabic.
I recommend to read the following Blog Multi-label Text Classification using BERT
4- Evaluation
Use the MADAR-DID-Scorer.py scrip to evaluate your systems.
Steaming and Classifying Arabic Tweets from Twitter
For those of you that might be interested in applying the Dialectal Arabic classifier in real classification of Arabic tweets, it is possible to create a developer account with Twitter.
You can use the Twitter API to stream tweets based on a specific keyword.
Go to the following for a guide. https://pythonprogramming.net/twitter-api-streaming-tweets-python-tutorial/
Please note it can take a little bit of time to get this working. You can pipe these tweets through your model in order to classify sentiment. For example, you could increase the probability threshold for both positive and negative and classify sporting events and monitor/graph sentiment throughout the event.