Learning Conjunctions in the SQ Model: In this problem, you will see a simple example of a machine learning algorithm that works in the statistical query (SQ) model, and will evaluate how it performs under both local and centralized DP.
The set-up is that we are given a data set D = ((x1,y1),...,(xn,yn)) where xi ∈ {0,1}d and yi ∈ {0,1}. We want a (local or centralized) differentially private algorithm M(D) that outputs a subset Sˆ ⊆ {1,...,d} of the variables such that the conjunction of the x-variables in Sˆ predicts the y variable well.
Specifically, following Valiant’s PAC model, we will measure utility as follows. Suppose that D consists of n iid draws from an (unknown) distribution P on {0,1}d × {0,1}. Furthermore, suppose that there is an (unknown) set S ⊆ {1,...,d} such that
.
That is, y can be perfectly predicted by some conjunction.[1] (Note that here and below, the notation (x,y) refers to a single labelled example drawn from P, not a dataset of n such values. We will use D = ((x1,y1),...,(xn,yn)) for the dataset.)
Then the goal of M is to output Sˆ that minimizes the expected classification error:
.
(This is an instantiation of the “risk minimization” problem described in the April 8 lecture, though you will not need anything from that lecture for this homework problem.)
The SQ algorithm is based on estimating the following quantity for each variable j = 1,...,d: [x[j] = 0 ∧ y = 1].
P
As shown in section, we have:
For each j ∈ S, we have pj = 0.
1
If Sˆ ⊇ S, then the false positive rate of Sˆ is 0.
The false negative rate of Sˆ is at most Pj∈Sˆ pj.
Based on these observations, an SQ algorithm M for learning conjunctions works as follows:
Using the dataset D, obtain an estimate ˆpj of pj for j = 1,...,d.
Output Sˆ = {j ∈ [d] : pˆj ≤ t} for an appropriate (small) threshold t.
You should do the following:
Describe and implement centralized and local DP versions of the above SQ algorithm, dividing the privacy budget equally among each of the d estimates ˆpj. Keep the threshold t as a free parameter that you can choose.
For both the centralized and local DP algorithms you give in Part 1a, analytically estimate Pr[Sˆ + S] as a function of t, n, ε, and |S|. Feel free to use a normal approximation for the binomial distribution, and to describe your estimate in terms of the CDF of a standard normal distribution. Use this as a guide to determine a setting of t to ensure that Pr[Sˆ + S] ≤ .1 as a function of n, d, and ε in each of the two settings, using the fact that |S| ≤ d.
An interest group targets a particular online political advertisement to users of a social media platform based on a conjunction of demographic characteristics. The users who are targeted with the advertisement are recorded. You get differentially private access (at to the user data and attempt to learn what demographic combination was being focused on.
In the dataset CaPUMS5Full.csv are a number of potential demographic characteristics that might have been used for targeting: (sex, married, black, asian, collegedegree, employed, militaryservice, uscitizen, disability, englishability). The variable targeted records those individuals who received the ad. Use your DP algorithms to learn a classifier to predict targeted=1 (which we have artificially created) from these variables. Demonstrate whether your centralized and local algorithms would succeed. Also, use a bootstrap to compare the false positive rates and false negative rates of the two methods as a function of n.
If useful, also included in this dataset is the variable blackfemale that you can use to test if your implementation correctly picks up black and female as the constituent parts. There is also a test dataset, hw4testdata.csv, that contains x1 ··· x10, and y=x1∧x2∧x3.
2
[1] This is known as the “realizable” case of learning, as opposed to the “agnostic” case, where no conjunction classifies perfectly, but the goal is to find one that does as well as possible.