CS6603 Project #3 Solution

Your shopping cart is empty.

Homework Project #3

Readings:

In this assignment, you’ll continue the process of exploring relationships in data. You’ll accomplish this task by computing some basic inferential statistical measures on a natural language-based dataset.

Natural language processing is concerned with the ability to process and analyze large amounts of natural language data, whether for automated sentence completion in emails, conversational agents and chatbots, or AI tools to help journalists. In this assignment, we will work with data from a classifier built to identify toxicity in comments from Wikipedia Talk Pages. The model is built from a dataset of 127,820 Talk Page comments, each labeled by human raters as toxic or non-toxic. A toxic comment is defined as a “rude, disrespectful, or unreasonable comment that is likely to make you leave a discussion.”

Step 1 – Download the modified dataset available on CANVAS – toxity_per_attribute.csv:
● <Wiki_ID> is unique identifier associated with Wikipedia comment
● <TOXICITY> is a toxicity value from 1 if the comment was considered toxic and value 0 if the comment was considered neutral or healthy
● < subgroup > columns: One column per human attribute; True if the comment mentioned this identity.

Step 2: Identify the protected class categories and members associated with each protected class category.
● For each protected class category, identify its relevant protected class members (e.g. christian + muslim + X -> Religion)
● Provide the classification results (i.e. list of protected class categories and their associated protected class members)

Step 3:
● Create a reduced data set by deleting any rows that have all FALSE values for every column in that row. Note: This is the reduced data set that you will use in all subsequent steps.
about similarities among the group members (e.g. gender identify: FALSE = 0; all others = 1; female = 2).
● As guidance, can use (Evans, J. D. (1996). Straightforward statistics for the behavioral sciences. Brooks/Cole Publishing) which suggests the following related to the absolute value of the correlation coefficient:
o .00-.19 “very weak” correlation o .20-.39 “weak” correlation o .40-.59 “moderate” correlation o .60-.79 “strong” correlation
o .80-1.0 “very strong” correlation

Example Output (for illustrative purposes only):

Classification Results - Protected Class Variables:
● Religion: christian, muslim
● Age: younger, older

Correlation Coefficients:
Religion (Protected Class Variable) Age (Protected Class Variable)
TOXICITY 0.03 0.7
CORRELATION STRENGTH Very weak correlation Strong correlation

TOXICITY and Age are strongly correlated.

Age

For these questions, choose only one of the protected class categories.

Step 5: Using your reduced data set, calculate the mean and standard deviation of TOXICITY associated with the protected class category (Hint: TOXICITY values should only be included in the calculation when the associated protected class value is not FALSE). Run the random sampling method using 10% and 60% of the data. For each, what is the mean and standard deviation? Calculate MoE using each sample size and Indicate (yes/no) if the mean values lie within the associated population margin of error.

Step 6: Using your reduced data set, calculate the mean and standard deviation of TOXICITY associated with each subgroup that is a member of the protected class category (Hint: TOXICITY values should only be included in the calculation when the associated protected class value is not FALSE). Run the random sampling method using 10% and 60% of the data. For each subgroup, what is the mean and standard deviation? Calculate MoE using each sample size and indicate (yes/no) if the mean values lie within the associated population margin of error.

Step 7: Plot (on one graph) - 1) the computed population mean/standard deviation (Step 4), (2) the computed mean/standard deviation for the protected class category (Step 5), and (3) the computed mean/standard deviation for each subgroup of the protected class category (Step 6). Which of the subgroups has the highest TOXICITY value? Which of the subgroups has the lowest TOXICITY value? Which of the subgroups has the largest difference in TOXICITY value when compared to the population mean? Does there seem to be any human bias in the data? Explain (in no more than 3-5 sentences).

Step 8: Turn in a report (in PDF format) documenting your outputs in each Step. The report should follow the JDF format. Jupyter notebook (ipynb files) submission is optional, but a final PDF document per JDF format is required. The file name for submission is GTuserName_Assignment_3, for example, Joyner03_Assignment_3. Reports that are not neat and well organized will receive up to a 10-point deduction. All charts, graphs, and tables should be generated in Python or Excel, or any other suitable software application, else appropriate points will be deducted, which could be the maximum.

Shopping cart

US$0

CS6603 Project #3 Solution

More products