Starting from:

$30

CS8803-Assignment 2 Solved

In this assignment, you’ll begin the process of exploring relationships in data. You’ll accomplish this task by computing some basic statistical measures on one of three datasets. This is a good time to learn or reboot your Python coding skills.

 

Step 1 - Select one of the datasets for completion of this assignment:

•        [mental-health-in-tech-survey.csv] Mental Health in Tech Survey: Survey on Mental Health in the Tech Workplace in 2014 - https://osmihelp.org/research/

 

Dependent Variables:  

o   treatment: Have you sought treatment for a mental health condition? (Yes/No) o mental_health_consequence: Do you think that discussing a mental health issue with your employer would have negative consequences? (Yes/Maybe/No)

o   phys_health_consequence: Do you think that discussing a physical health issue with your employer would have negative consequences? (Yes/Maybe/No)

 

•        [diabetic_data.csv] Diabetes 130 US hospitals for years 1999-2008: Diabetes – readmission - https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008

 

Dependent Variables:  

o   time_in_hospital: a numeric value representing number of days between admission and discharge

o   readmitted: Days to inpatient readmission - “<30” if the patient was readmitted in less than 30 days, “30” if the patient was readmitted in more than 30 days, and “No” for no record of readmission.

 

•        [compas-scores-two-years.csv] COMPAS Recidivism Racial Bias: Racial Bias in inmate COMPAS reoffense risk scores for Florida (ProPublica) - https://github.com/propublica/compasanalysis

 

Dependent Variables: o decile_score: a numeric value between 1 and 10 corresponding to the recidivism risk score generated by COMPAS software (a small number corresponds to a low risk, a larger number corresponds to a high risk).

o   two_year_recid: a numeric indicator of whether the defendant recidivated two years after previous charge (0: no, did not recidivate, 1: yes, did recidivate)

 

 

Step 2 - Explore the data by answering the following questions:

•        Which dataset did you select?

•        How many observations are in the dataset?

•        How many variables in the dataset?

•        Does this dataset seem to belong to a regulated domain in law as discussed in the lectures? If yes, which one?

•        How many variables in the dataset are associated with a legally recognized protected class? In a table format, list those variables associated with a protected class, identify the protected class and the associated legal precedence/law as discussed in the lectures.

 

Example Output (associated with a different dataset) - Dataset: Housing Decisions in Metro-Atlanta

Number of Observations: 1,400

Number of Variables: 16

Regulated Domain in Law: Housing (Fair Housing Act)

Number of Protected Class Variables: 2

 
Protected Class  
Law
nationality
National origin
Civil Rights Act of 1964, 1991
pregnant (y/n)
Pregnancy
Pregnancy Discrimination Act
 

 

Step 3 - Determine the relationships between dependent and independent variables

The frequency of a value represents the number of times a value occurs in a data set. Compute the frequency of each value associated with each dependent variable (listed in Step 1) as a function of all of the protected class variables (independent variables) identified in Step 2. Create histogram(s) comparing the frequency values of the dependent variable as a function of the independent variable. Hint: For variables that are continuous, you might consider creating intervals that represent the data. For categorical/ordinal/nominal values, you might consider converting to numerical values.

 

Example Output for One Dependent-Independent Variable Combination:   

Independent Variable -

Protected Class Variable
Dependent Variable -

Housing Decision (Y/N)
Pregnant – Y
Frequency of Y: 50 Frequency of N: 120
Pregnant – N
Frequency of Y: 130 Frequency of N: 20
 

 

 

Step 4 - Show how to manipulate with data  

Select one protected class variable (independent variable) and one dependent variable. 1) Create a graph to support the “fairness” hypothesis: The system is fair. There is no difference in the outcomes. 2) Create a graph to support the bias hypothesis: The system is biased. There is a difference in the outcomes. For each, provide a brief description of your manipulations.

 

Example Output: 

 

1)     Fair Hypothesis: As seen from this graph, housing decisions are not dependent on the pregnancy status of women. [Manipulations: Used line graph; Increased Scale to +-50; Mapped the ratio of positive Y decisions (i.e. 50/180 versus 130/180); No label on the Y-Axis].

 Difference     in                           Housing                          Decisions            Based                  on                         Pregnancy

 

2)     Bias Hypothesis: As seen from this graph, housing decisions are significantly dependent on the pregnancy status of women. [This hypothesis was easily supported with the data so didn’t require much in manipulations: Used stacked bar graph; Reduced Scale; Reworded labels].

 

  

 

 

Step 5: Given your selected protected class variable (independent variable), calculate the average (mean, median, and mode) values of the protected class group (Hint: Variables might need to be converted to numerical values as needed). Run the random sampling method using 50% of the data to create a reduced dataset. Calculate the average (mean, median, and mode) values of the protected class group. Indicate if there is a difference (or not) between the original dataset and the reduced dataset for any of the averages.  Provide all results. 

 

Protected Class Variable (Pregnant)
Mean
Median
Mode
Original Data Set  
0 (NO)
0 (NO)
0 (NO)
Reduced Data Set
0 (NO)
1 (YES)
0 (NO)
Difference
No Difference
Difference
No Difference
 

Step 6: Given your reduced dataset from Step 5, Repeat Step 3 (frequency and histogram) using your selected independent variable as a function of your selected dependent variable (from Step 4).  Explain any differences (in no more than 2 sentences). If you used the random sampling method, would members associated with the protected class variable benefit or be harmed? Explain your reasoning (in no more than 2 sentences).In this assignment, you’ll begin the process of exploring relationships in data. You’ll accomplish this task by computing some basic statistical measures on one of three datasets. This is a good time to learn or reboot your Python coding skills.

 

Step 1 - Select one of the datasets for completion of this assignment:

•        [mental-health-in-tech-survey.csv] Mental Health in Tech Survey: Survey on Mental Health in the Tech Workplace in 2014 - https://osmihelp.org/research/

 

Dependent Variables:  

o   treatment: Have you sought treatment for a mental health condition? (Yes/No) o mental_health_consequence: Do you think that discussing a mental health issue with your employer would have negative consequences? (Yes/Maybe/No)

o   phys_health_consequence: Do you think that discussing a physical health issue with your employer would have negative consequences? (Yes/Maybe/No)

 

•        [diabetic_data.csv] Diabetes 130 US hospitals for years 1999-2008: Diabetes – readmission - https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008

 

Dependent Variables:  

o   time_in_hospital: a numeric value representing number of days between admission and discharge

o   readmitted: Days to inpatient readmission - “<30” if the patient was readmitted in less than 30 days, “30” if the patient was readmitted in more than 30 days, and “No” for no record of readmission.

 

•        [compas-scores-two-years.csv] COMPAS Recidivism Racial Bias: Racial Bias in inmate COMPAS reoffense risk scores for Florida (ProPublica) - https://github.com/propublica/compasanalysis

 

Dependent Variables: o decile_score: a numeric value between 1 and 10 corresponding to the recidivism risk score generated by COMPAS software (a small number corresponds to a low risk, a larger number corresponds to a high risk).

o   two_year_recid: a numeric indicator of whether the defendant recidivated two years after previous charge (0: no, did not recidivate, 1: yes, did recidivate)

 

 

Step 2 - Explore the data by answering the following questions:

•        Which dataset did you select?

•        How many observations are in the dataset?

•        How many variables in the dataset?

•        Does this dataset seem to belong to a regulated domain in law as discussed in the lectures? If yes, which one?

•        How many variables in the dataset are associated with a legally recognized protected class? In a table format, list those variables associated with a protected class, identify the protected class and the associated legal precedence/law as discussed in the lectures.

 

Example Output (associated with a different dataset) - Dataset: Housing Decisions in Metro-Atlanta

Number of Observations: 1,400

Number of Variables: 16

Regulated Domain in Law: Housing (Fair Housing Act)

Number of Protected Class Variables: 2

 
Protected Class  
Law
nationality
National origin
Civil Rights Act of 1964, 1991
pregnant (y/n)
Pregnancy
Pregnancy Discrimination Act
 

 

Step 3 - Determine the relationships between dependent and independent variables

The frequency of a value represents the number of times a value occurs in a data set. Compute the frequency of each value associated with each dependent variable (listed in Step 1) as a function of all of the protected class variables (independent variables) identified in Step 2. Create histogram(s) comparing the frequency values of the dependent variable as a function of the independent variable. Hint: For variables that are continuous, you might consider creating intervals that represent the data. For categorical/ordinal/nominal values, you might consider converting to numerical values.

 

Example Output for One Dependent-Independent Variable Combination:   

Independent Variable -

Protected Class Variable
Dependent Variable -

Housing Decision (Y/N)
Pregnant – Y
Frequency of Y: 50 Frequency of N: 120
Pregnant – N
Frequency of Y: 130 Frequency of N: 20
 

 

 

Step 4 - Show how to manipulate with data  

Select one protected class variable (independent variable) and one dependent variable. 1) Create a graph to support the “fairness” hypothesis: The system is fair. There is no difference in the outcomes. 2) Create a graph to support the bias hypothesis: The system is biased. There is a difference in the outcomes. For each, provide a brief description of your manipulations.

 

Example Output: 

 

1)     Fair Hypothesis: As seen from this graph, housing decisions are not dependent on the pregnancy status of women. [Manipulations: Used line graph; Increased Scale to +-50; Mapped the ratio of positive Y decisions (i.e. 50/180 versus 130/180); No label on the Y-Axis].

 Difference     in                           Housing                          Decisions            Based                  on                         Pregnancy

 

2)     Bias Hypothesis: As seen from this graph, housing decisions are significantly dependent on the pregnancy status of women. [This hypothesis was easily supported with the data so didn’t require much in manipulations: Used stacked bar graph; Reduced Scale; Reworded labels].

 

  

 

 

Step 5: Given your selected protected class variable (independent variable), calculate the average (mean, median, and mode) values of the protected class group (Hint: Variables might need to be converted to numerical values as needed). Run the random sampling method using 50% of the data to create a reduced dataset. Calculate the average (mean, median, and mode) values of the protected class group. Indicate if there is a difference (or not) between the original dataset and the reduced dataset for any of the averages.  Provide all results. 

 

Protected Class Variable (Pregnant)
Mean
Median
Mode
Original Data Set  
0 (NO)
0 (NO)
0 (NO)
Reduced Data Set
0 (NO)
1 (YES)
0 (NO)
Difference
No Difference
Difference
No Difference
 

Step 6: Given your reduced dataset from Step 5, Repeat Step 3 (frequency and histogram) using your selected independent variable as a function of your selected dependent variable (from Step 4).  Explain any differences (in no more than 2 sentences). If you used the random sampling method, would members associated with the protected class variable benefit or be harmed? Explain your reasoning (in no more than 2 sentences).

More products