$30
In this assignment, you’ll begin the process of exploring relationships in data. You’ll accomplish this task by computing some basic statistical measures on one of three datasets. This is a good time to learn or reboot your Python coding skills.
Step 1 - Select one of the datasets for completion of this assignment:
• [mental-health-in-tech-survey.csv] Mental Health in Tech Survey: Survey on Mental Health in the Tech Workplace in 2014 - https://osmihelp.org/research/
Dependent Variables:
o treatment: Have you sought treatment for a mental health condition? (Yes/No) o mental_health_consequence: Do you think that discussing a mental health issue with your employer would have negative consequences? (Yes/Maybe/No)
o phys_health_consequence: Do you think that discussing a physical health issue with your employer would have negative consequences? (Yes/Maybe/No)
• [diabetic_data.csv] Diabetes 130 US hospitals for years 1999-2008: Diabetes – readmission - https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008
Dependent Variables:
o time_in_hospital: a numeric value representing number of days between admission and discharge
o readmitted: Days to inpatient readmission - “<30” if the patient was readmitted in less than 30 days, “30” if the patient was readmitted in more than 30 days, and “No” for no record of readmission.
• [compas-scores-two-years.csv] COMPAS Recidivism Racial Bias: Racial Bias in inmate COMPAS reoffense risk scores for Florida (ProPublica) - https://github.com/propublica/compasanalysis
Dependent Variables: o decile_score: a numeric value between 1 and 10 corresponding to the recidivism risk score generated by COMPAS software (a small number corresponds to a low risk, a larger number corresponds to a high risk).
o two_year_recid: a numeric indicator of whether the defendant recidivated two years after previous charge (0: no, did not recidivate, 1: yes, did recidivate)
Step 2 - Explore the data by answering the following questions:
• Which dataset did you select?
• How many observations are in the dataset?
• How many variables in the dataset?
• Does this dataset seem to belong to a regulated domain in law as discussed in the lectures? If yes, which one?
• How many variables in the dataset are associated with a legally recognized protected class? In a table format, list those variables associated with a protected class, identify the protected class and the associated legal precedence/law as discussed in the lectures.
Example Output (associated with a different dataset) - Dataset: Housing Decisions in Metro-Atlanta
Number of Observations: 1,400
Number of Variables: 16
Regulated Domain in Law: Housing (Fair Housing Act)
Number of Protected Class Variables: 2
Protected Class
Law
nationality
National origin
Civil Rights Act of 1964, 1991
pregnant (y/n)
Pregnancy
Pregnancy Discrimination Act
Step 3 - Determine the relationships between dependent and independent variables
The frequency of a value represents the number of times a value occurs in a data set. Compute the frequency of each value associated with each dependent variable (listed in Step 1) as a function of all of the protected class variables (independent variables) identified in Step 2. Create histogram(s) comparing the frequency values of the dependent variable as a function of the independent variable. Hint: For variables that are continuous, you might consider creating intervals that represent the data. For categorical/ordinal/nominal values, you might consider converting to numerical values.
Example Output for One Dependent-Independent Variable Combination:
Independent Variable -
Protected Class Variable
Dependent Variable -
Housing Decision (Y/N)
Pregnant – Y
Frequency of Y: 50 Frequency of N: 120
Pregnant – N
Frequency of Y: 130 Frequency of N: 20
Step 4 - Show how to manipulate with data
Select one protected class variable (independent variable) and one dependent variable. 1) Create a graph to support the “fairness” hypothesis: The system is fair. There is no difference in the outcomes. 2) Create a graph to support the bias hypothesis: The system is biased. There is a difference in the outcomes. For each, provide a brief description of your manipulations.
Example Output:
1) Fair Hypothesis: As seen from this graph, housing decisions are not dependent on the pregnancy status of women. [Manipulations: Used line graph; Increased Scale to +-50; Mapped the ratio of positive Y decisions (i.e. 50/180 versus 130/180); No label on the Y-Axis].
Difference in Housing Decisions Based on Pregnancy
2) Bias Hypothesis: As seen from this graph, housing decisions are significantly dependent on the pregnancy status of women. [This hypothesis was easily supported with the data so didn’t require much in manipulations: Used stacked bar graph; Reduced Scale; Reworded labels].
Step 5: Given your selected protected class variable (independent variable), calculate the average (mean, median, and mode) values of the protected class group (Hint: Variables might need to be converted to numerical values as needed). Run the random sampling method using 50% of the data to create a reduced dataset. Calculate the average (mean, median, and mode) values of the protected class group. Indicate if there is a difference (or not) between the original dataset and the reduced dataset for any of the averages. Provide all results.
Protected Class Variable (Pregnant)
Mean
Median
Mode
Original Data Set
0 (NO)
0 (NO)
0 (NO)
Reduced Data Set
0 (NO)
1 (YES)
0 (NO)
Difference
No Difference
Difference
No Difference
Step 6: Given your reduced dataset from Step 5, Repeat Step 3 (frequency and histogram) using your selected independent variable as a function of your selected dependent variable (from Step 4). Explain any differences (in no more than 2 sentences). If you used the random sampling method, would members associated with the protected class variable benefit or be harmed? Explain your reasoning (in no more than 2 sentences).In this assignment, you’ll begin the process of exploring relationships in data. You’ll accomplish this task by computing some basic statistical measures on one of three datasets. This is a good time to learn or reboot your Python coding skills.
Step 1 - Select one of the datasets for completion of this assignment:
• [mental-health-in-tech-survey.csv] Mental Health in Tech Survey: Survey on Mental Health in the Tech Workplace in 2014 - https://osmihelp.org/research/
Dependent Variables:
o treatment: Have you sought treatment for a mental health condition? (Yes/No) o mental_health_consequence: Do you think that discussing a mental health issue with your employer would have negative consequences? (Yes/Maybe/No)
o phys_health_consequence: Do you think that discussing a physical health issue with your employer would have negative consequences? (Yes/Maybe/No)
• [diabetic_data.csv] Diabetes 130 US hospitals for years 1999-2008: Diabetes – readmission - https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008
Dependent Variables:
o time_in_hospital: a numeric value representing number of days between admission and discharge
o readmitted: Days to inpatient readmission - “<30” if the patient was readmitted in less than 30 days, “30” if the patient was readmitted in more than 30 days, and “No” for no record of readmission.
• [compas-scores-two-years.csv] COMPAS Recidivism Racial Bias: Racial Bias in inmate COMPAS reoffense risk scores for Florida (ProPublica) - https://github.com/propublica/compasanalysis
Dependent Variables: o decile_score: a numeric value between 1 and 10 corresponding to the recidivism risk score generated by COMPAS software (a small number corresponds to a low risk, a larger number corresponds to a high risk).
o two_year_recid: a numeric indicator of whether the defendant recidivated two years after previous charge (0: no, did not recidivate, 1: yes, did recidivate)
Step 2 - Explore the data by answering the following questions:
• Which dataset did you select?
• How many observations are in the dataset?
• How many variables in the dataset?
• Does this dataset seem to belong to a regulated domain in law as discussed in the lectures? If yes, which one?
• How many variables in the dataset are associated with a legally recognized protected class? In a table format, list those variables associated with a protected class, identify the protected class and the associated legal precedence/law as discussed in the lectures.
Example Output (associated with a different dataset) - Dataset: Housing Decisions in Metro-Atlanta
Number of Observations: 1,400
Number of Variables: 16
Regulated Domain in Law: Housing (Fair Housing Act)
Number of Protected Class Variables: 2
Protected Class
Law
nationality
National origin
Civil Rights Act of 1964, 1991
pregnant (y/n)
Pregnancy
Pregnancy Discrimination Act
Step 3 - Determine the relationships between dependent and independent variables
The frequency of a value represents the number of times a value occurs in a data set. Compute the frequency of each value associated with each dependent variable (listed in Step 1) as a function of all of the protected class variables (independent variables) identified in Step 2. Create histogram(s) comparing the frequency values of the dependent variable as a function of the independent variable. Hint: For variables that are continuous, you might consider creating intervals that represent the data. For categorical/ordinal/nominal values, you might consider converting to numerical values.
Example Output for One Dependent-Independent Variable Combination:
Independent Variable -
Protected Class Variable
Dependent Variable -
Housing Decision (Y/N)
Pregnant – Y
Frequency of Y: 50 Frequency of N: 120
Pregnant – N
Frequency of Y: 130 Frequency of N: 20
Step 4 - Show how to manipulate with data
Select one protected class variable (independent variable) and one dependent variable. 1) Create a graph to support the “fairness” hypothesis: The system is fair. There is no difference in the outcomes. 2) Create a graph to support the bias hypothesis: The system is biased. There is a difference in the outcomes. For each, provide a brief description of your manipulations.
Example Output:
1) Fair Hypothesis: As seen from this graph, housing decisions are not dependent on the pregnancy status of women. [Manipulations: Used line graph; Increased Scale to +-50; Mapped the ratio of positive Y decisions (i.e. 50/180 versus 130/180); No label on the Y-Axis].
Difference in Housing Decisions Based on Pregnancy
2) Bias Hypothesis: As seen from this graph, housing decisions are significantly dependent on the pregnancy status of women. [This hypothesis was easily supported with the data so didn’t require much in manipulations: Used stacked bar graph; Reduced Scale; Reworded labels].
Step 5: Given your selected protected class variable (independent variable), calculate the average (mean, median, and mode) values of the protected class group (Hint: Variables might need to be converted to numerical values as needed). Run the random sampling method using 50% of the data to create a reduced dataset. Calculate the average (mean, median, and mode) values of the protected class group. Indicate if there is a difference (or not) between the original dataset and the reduced dataset for any of the averages. Provide all results.
Protected Class Variable (Pregnant)
Mean
Median
Mode
Original Data Set
0 (NO)
0 (NO)
0 (NO)
Reduced Data Set
0 (NO)
1 (YES)
0 (NO)
Difference
No Difference
Difference
No Difference
Step 6: Given your reduced dataset from Step 5, Repeat Step 3 (frequency and histogram) using your selected independent variable as a function of your selected dependent variable (from Step 4). Explain any differences (in no more than 2 sentences). If you used the random sampling method, would members associated with the protected class variable benefit or be harmed? Explain your reasoning (in no more than 2 sentences).