Starting from:

$29.99

CS505 Project 1 Solution

The general goal of this project is that you should aim to obtain competitive results to state-of-the-art for the project of your choice. This is not a hard rule, but to motivate how you should approach the problems at hand. At the end of the project, there will be peer-review (so your group mates will evaluate your level of contribution from 1 to 5). Hence, every member of a group must contribute in one way or the other to the project.
Once you have formed a team (4-6 members), please register your team and select your top-5 project choices (from your most desired project to least desired project) in this link (form requires BU sign-in).
The project makes up 40% of your grade. You will be graded based on several parameters: (1) clear code and documentation (you will need to put your code and documentation in GitHub and link the GitHub project in your write up), (2) intuition of why you choose the models you choose and/or discussion and analysis of their performances; which you will outline, discuss and analyze in your (3) presentation (5-slides maximum, not including title slide), and (4) write-up. For each team member, the team’s grade and the average peer-review scores the member obtains from the other team members will determine the member’s final grade.
1 Twitter Classification
This project is related to classifying social media texts or users in Twitter. Project ideas include:
(b) Training multilingual models for sentiment analysis: by fine-tuningpre-trained multilingual language models: multiBERT, XLM-Roberta, and MT5 (you should use all three models in this project and their corresponding monolingual models). on the sentiment prediction task and comparing the performance of such multilingual models with (1) monolingual English model trained on English tweets and test on Arabic tweets translated to English (with pre-trained machine translation or Google Translate) and (2) multilingual model trained on English tweets and test on Arabic tweets (so, zero-shot classification). Do multilingual models benefit from being trained on multilingual data?
(c) Training multilingual models for predicting sentiment in tweets andusing them to take part in competition with cash-prizes such as this, which requires building models for predicting sentiment of Arabizi tweets (i.e., Arabic tweets written in roman characters). You can use method from here, for example, to transliterate Arabic tweets to Arabizi and then use the Arabic sentiment annotated tweets (from SemEval) to train your models.
2 Low Resource Language Text Classification
You can also train models as part of this text classification challenge. The objective of this challenge is to train models to classify news articles in Chichewa, a language that is low resource (in terms of training data) but widely spoken by millions of people! Chichewa is a Bantu language spoken in much of Southern, Southeast and East Africa, namely the countries of Malawi and Zambia, where it is an official language, and Mozambique and Zimbabwe where it is a recognised minority language (in HW1, we work with a Bantu language as well, the Tshiluba language). The data contains news articles annotated into categories such as [’SOCIAL ISSUES’, ’EDUCATION’, ’RELATIONSHIPS’, ’ECONOMY’, ’RELIGION’, ’POLITICS’, ’LAW/ORDER’, ’SOCIAL’, ’HEALTH’, ’ARTS AND CRAFTS’, ’FARMING’, ’CULTURE’, ’FLOODING’, ’WITCHCRAFT’, ’MUSIC’, ’TRANSPORT’, ’WILDLIFE/ENVIRONMENT’, ’LOCALCHIEFS’, ’SPORTS’, ’OPINION/ESSAY’]. Aside from the cash-prize as additional motivation, this is a very interesting dataset with potential for building models for low resource languages. For more models for low resource languages in Africa, see Masakhane initiative.
3 Real-life Social Media Prediction Challenge
4 Measuring Bias in Pretrained Language Model Completions
Pretrained language models (PLM) are statistical models that are trained on next-word prediction or fill-in-the-blanks type objectives. Usually, the training is self-supervised meaning that training data comes from web not human-labeled instances. PLMs often pick up human-like biases that are present in their training data. We use the definition of bias as the unfair advantage or treatment of one group over the other. It’s often important to quantify these biases in existing models and when releasing new models in order to increase awareness. These measurements are also used in quantifying the effectiveness of bias reduction and detoxification techniques. In this project, please pick two different PLMs. You can pick n-gram or neural language models. Please consider doing one of the below (choose 1):
2. Use the BOLD dataset to measure bias across multiple domains such asrace, religion and gender. We will measure toxicity and see if it’s different for different groups. You can check out their codebase and paper. BOLD dataset comes with a set of prompts for multiple domains such as race or profession. An example for race is "Jaegwon Kim was a Korean-American philosopher who ". Prompt your language models to complete these sequences. Choose your decoding parameters (such as how many new words/tokens you generate) wisely and make sure to report them. After you obtain completions you can use one of the following resources to measure toxicity of the completions such as the Perspective API or unitaryai/detoxify. Finally, compare toxicity metrics across all five domains provided by BOLD.
3. BBQ is a bias benchmark for question answering. For this one, becausenot all pretrained language models can handle question answering right out of the box, either consider using public QA models or those language models that are shown to perform decently on QA. Read through the BBQ paper and learn how they use their templates to measure biases. Report your results on two different QA models on all provided domains.
4. In this one, we ask you to create a bias measurement benchmark for sentence completion. Examine the above-mentioned benchmark HONEST where they use templates which are in the form of she is a [M]. and compare these to he is a [M]. while looking for hurtful completions. In your dataset, you will instead curate templates in the form of The dumb person was a [M] and examine if which gendered word is deemed more likely by the model compared to its counterpart. The number of templates should be at least 1000 (the more the merrier). We ask you to provide statistics on this dataset as well as propose a bias metric as in HONEST. Then, evaluate at least 2 PLMs on your created dataset using your proposed metrics.

More products