1 [Constrained optimization] Show that the distance from the hyperplane g(x) = w·x+w0 = 0 to a point xa is |g(xa)|/kwk by minimizing the squared distance kx − xak2 subject to the constraint g(x) = 0.
2 [A better output Perceptron algorithm guarantee] In class, we saw that when the training sample S is linearly separable with a maximum margin γ 0, then the Perceptron algorithm run cyclically over S is guaranteed to converge after updates, where R is the radius of the sphere containing the sample points. This does not guarantee however that the hyperplane solution returned by Perceptron, i.e. wT achieves a margin close to γ.
(i) Show an example training dataset S in R2 that has margin γ, and an order of updates made by the Perceptron algorithm where the hyperplane solution returned has arbitrarily bad margin on S.
(ii) Consider the following modification to the perceptron algorithm:
Modified Perceptron Algorithm
Input: training dataset S = (xi,yi)i=1,...,n
Output: learned vector w
- Initialize w0 := 0,t := 0
- while there exists an example (x,y) ∈ S, such that 2y(wt · x) ≤ γkwtk
- set wt+1 := wt + yx
- set t := t + 1
- return wt.
(a) If the Modified Perceptron Algorithm (MPA) terminates after T rounds, what margin guarantee is achieved by the hyperplane wT returned by MPA? Justify your answer.
(b) We will now prove step-by-step the mistake bound for the Modified Perceptron Algorithm (MPA) algorithm.
i. Show that after T rounds Tγ ≤ kwTk, and observe that if kwTk < 4R2/γ, then T < 4R2/γ2.
In what follows, we will assume that kwTk ≥ 4R2/γ.
ii. Show that for any iteration t when mistake was made, the following holds:
kwtk2 ≤ (kwt−1k + γ/2)2 + R2.
iii. Infer from that that for any iteration t, we have
.
iv. Using the previous question, show that for any iteration t such that either kwt−1k ≥
, we have
v. Show that kw0k ≤ R ≤ 4R2/γ. Since by assumption we have , conclude that there must exist some largest iteration t0 such that kwt0−1k ≤ and . vi. Show that , and finally deduce the mistake bound.
3 [Making data linearly separable by feature space mapping] Consider the infinite dimensional feature space mapping
.
(It may be helpful to sketch the function for understanding the mapping and answering the questions below)
(i) Show that for any n distinct points x1,...,xn, there exists a σ 0 such that the mapping Φσ can linearly separate any binary labeling of the n points.
(ii) Show that one can efficiently compute the dot products in this feature space, by giving an analytical formula for Φσ(x) · Φσ(x0) for arbitrary points x and x0.
4 [Designing socially aware classifiers] Traditional Machine Learning research focuses on simply improving the accuracy. However, the model with the highest accuracy may be discriminatory and thus may have undesirable social impact that unintentionally hurts minority groups[1]. To overcome such undesirable impacts, researchers have put lots of effort in the field called Computational Fairness in recent years.
Two central problems of Computational Fairness are: (1) what is an appropriate definition of fairness that works under different settings of interest? (2) How can we achieve the proposed definitions without sacrificing on prediction accuracy?
In this problem, we will focus on some of the ways we can address the first problem. There are two categories of fairness definitions: individual fairness[2] and group fairness[3]. Most works in the literature focus on the group fairness. Here we will study some of the most popular group fairness definitions and explore them empirically on a real-world dataset.
Generally, group fairness concerns with ensuring that group-level statistics are same across all groups. A group is usually formed with respect to a feature called the sensitive attribute. Most common sensitive features include: gender, race, age, religion, income-level, etc. Thus, group fairness ensures that statistics across the sensitive attribute (such as across, say, different age groups) remain the same.
For simplicity, we only consider the setting of binary classification with a single sensitive attribute. Unless stated otherwise, we also consider the sensitive attribute to be binary. (Note that the binary assumption is only for convenience and results can be extended to non-binary cases as well.) Notations:
Denote X ∈ Rd, A ∈ {0,1} and Y ∈ {0,1} to be three random variables: non-sensitive features of an instance, the instance’s sensitive feature and the target label of the instance respectively, such that (X,A,Y ) ∼ D. Denote a classifier f : Rd → {0,1} and denote Yˆ := f(X).
For simplicity, we also use the following abbreviations:
P := P(X,A,Y )∼D and Pa := P(X,a,Y )∼D
We will explore the following are three fairness definitions.
- Demographic Parity (DP)
P0[Yˆ = yˆ] = P1[Yˆ = yˆ]
(equal positive rate across the sensitive attribute)
- Equalized Odds (EO)
∀yˆ ∈ {0,1}
P0[Yˆ = yˆ | Y = y] = P1[Yˆ = yˆ | Y = y] ∀y, yˆ ∈ {0,1}
(equal true positive- and true negative-rates across the sensitive attribute)
- Predictive Parity (PP)
P0[Y = y | Yˆ = yˆ] = P1[Y = y | Yˆ = yˆ] ∀y, yˆ ∈ {0,1}
(equal positive predictive- and negative predictive-value across the sensitive attribute)
Part 0: The basics.
(i) Why is it not enough to just remove the sensitive attribute A from the dataset to achieve fairness as per the definitions above? Explain with a concrete example.
Part 1: Sometimes, people write the same fairness definition in different ways.
(ii) Show that the following two definitions for Demographic Parity is equivalent under our setting:
P0[Yˆ = 1] = P1[Yˆ = 1] ⇐⇒ P[Yˆ = 1] = Pa[Yˆ = 1] ∀a ∈ {0,1}
Part 2: In this part, we will explore the COMPAS dataset (available in hw2data.zip). The task is to predict two year recidivism. Download the COMPAS dataset from the class’s website. In this dataset, the target label Y is twoyearrecid and the sensitive feature A is race.
(iii) Develop the following classifiers: (1) MLE based classifier, (2) nearest neighbor classifier, and (3) decision tree classifier, for the given dataset.
For MLE classifier, you can model the class conditional densities by a Multivariate Gaussian distribution. For nearest neighbor classifier, you should consider different values of k and the distance metric (e.g. L1,L2,L∞).
(you may use builtin functions to develop your classifier, and you do not need to write the classifier form scratch.)
You do not need to submit any code on Courseworks.
(iv) Which classifier (discussed in previous part) is better for this prediction task? You must justify your answer with appropriate performance graphs demonstrating the superiority of one classifier over the other. Example things to consider: how does the training sample size affects the classification performance.
(v) To what degree the fairness definitions are satisfied for each of the classifiers you developed? Show your results with appropriate performance graphs.
For each fairness measure, which classifier is the most fair? How would you summarize the difference of these algorithms?
(vi) Choose any one of the three fairness definitions. Describe a real-world scenario where this definition is most reasonable and applicable. What are the potential disadvantage(s) of this fairness definition?
(You are free to reference online and published materials to understand the strengths and weaknesses of each of the fairness definitions. Make sure cite all your resources.)
[1] see e.g. Machine Bias by Angwin et al. for bias in recidivism predication, and Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification by Buolamwini and Gebru for bias in face recognition
[2] see e.g. Fairness Through Awareness by Dwork et al.
[3] see e.g. Equality of Opportunity in Supervised Learning by Hardt et al.