Instructions: Submit a single PDF file containing your solutions, plots, analyses, and documented code. Also include a link to a public repository with your code (such as GitHub/GitLab). Make sure to list all collaborators and references.
Mechanisms: Consider the following mechanisms M that takes a dataset x ∈ [0,1]n and returns an estimate of the mean ¯ ., for Z ∼ Lap(2/n), where for real numbers y and a ≤ b, [y]ba denotes the
“clamping” function:
a
b
[y]a = y
b
if y < a if a ≤ y ≤ b .
if y b
, for Z ∼ Lap(2/n). iii
(
1 w.p. x
M(x) =.
0 w.p. 1 − x.
iv M(x) = Y where Y has probability density function fY given as follows:
if y ∈ [0,1].
0 if y /∈ [0,1].
(This is an instantiation of a continuous version of the exponential mechanism.)
Which of the above mechanisms meet the definition of (,0)-differential privacy for a finite value of , and what is the smallest value of (possibly as a function of n) for which they do?
For those that do not meet the definition, calculate the smallest value of δ (again possibly as a function of n) for which they satisfy (,δ) differential privacy for a finite value of .
Describe how you would modify the algorithms to have tunable privacy parameters (and δ in case of mechanisms that require it) and tunable data domain [a,b] (rather than [0,1]).
Which of these algorithms do you consider to be “best” for releasing a mean and why?
(There is not a single “right” answer for this problem.)
Evaluating DP Algorithms with Synthetic Data: Consider a dataset x ∈ Nn drawn from a Poisson process, which has probability distribution Pr[xi = k] = 10ke−[1]0/k! for natural numbers k (where we consider k = 0 to be a natural number and define 0! = 1).Write a data generating process (DGP) function that generates a dataset x ∈ Nn according the above Poisson process.
Pick one of your differentially private mechanisms from question 1 (generalized to allow for arbitrary choices of and data range [a,b] as parameters) that releases an estimate of ¯x. Implement this mechanism as a function which is given a vector of values x ∈ [a,b]n and an and makes a differentially private release. You can assume the sample size n is public knowledge. To apply your mechanism to unbounded data x ∈ Rn, you will have to clamp x to a chosen range [a,b]. For simplicity, we will fix a = 0 and only consider the effect of varying b.
Recall the discussion on clamping from class; if the range is large, the sensitivity increases, so noise increases and utility drops. However, if you clip the values too aggressively the answer will be biased, and again utility will drop. For n = 200 and 5, plot the root mean squared error as a function of the upper bound b. Identify the approximate optimal value b∗ of b for this data distribution.
Suppose we have an actual (not synthetic) dataset x ∈ Nn for which we want to release a differentially private mean, and we don’t know the underlying distribution of x. Again, we need to select the parameter b and want to do so in a way that minimizes the error. A natural idea is to use a (nonparametric) bootstrap1 to generate many datasets that are “similar” to x in place of the data-generating process above, and optimize the choice of b as above. Once we find an optimal value b∗, we then do our differentially private release on the dataset x
Explain why this approach is not safe in general and may violate differential privacy.
Propose some alternative methods for determining a good upper bound b for a given sensitive dataset x, while continuing to satisfying the definition of differential privacy.
Regression: Consider a dataset where each of its n rows is a pair of real numbers (xi,yi), where xi is drawn from a Poisson distribution as in Problem 2, and yi is then a noisy linear function of xi, specifically:
yi = βxi + α + νi; νi ∼ N(0,σ[2]) (1)
for unknown parameters α,β,σ.
One simple way to estimate the parameters α and β without privacy is by a simple linear regression, using the following estimators:
(2)
αˆ = ¯y − βˆx¯; (3)
In class and section, we saw an algorithm and implementation to produce a differentially private version of βˆ.2 Augment this implementation to also produce a differentially private version of ˆα, so that the overall method for computing both ˆα and -DP, for an input parameter . If your algorithms use several basic differentially private algorithms as subroutines, divide the overall privacy budget of among them evenly. You can clamp both the xi’s and yi’s using reasonable values of your choosing (perhaps following your choice from Problem 2 for x). Describe the reasoning behind your choice in your write up.
Evaluate the performance of your algorithm using a Monte Carlo simulation with synthetic data as in Problem 2 Parts (a)–(c), for the parameters = 1 and n = 1000.
Measure utility by the mean-squared residuals:
.
(The non-private method for simple linear regression described above is chosen to minimize this quantity, hence the term “least-squares regression”.) Plot and compare the distributions of mean-squared residuals you get with your differentially private simple linear regression and with a non-private simple linear regression.
Now, run experiments to see if there is a different partition of your privacy budget in your algorithm that yields better utility. Use a grid search[3] to explore different partitions of and see if you find one that is convincingly better (in terms of mean-squared residuals) than an equal partition under the given data distribution. Show and explain your results.
DP vs. Reconstruction Attacks: Suppose M : {0,1}n → Y is an (,δ)-DP mechanism and A : Y → {0,1}n is an adversary that is trying to reconstruct the sensitive bits in the dataset x ∈ {0,1}n from the output M(x). Suppose the dataset is a random variable X = (X1,...,Xn) consisting of n iid draws from a Bernoulli(p) distribution, for a known value of p. Prove that the expected fraction of bits that the adversary successfully reconstructs is not much larger than the trivial bound of max{p,1 − p} (which can be achieved by guessing the all-zeroes or all-ones dataset). Specifically:
E[#
(Hint: write the quantity inside the expectation as an average of indicator random variables, and for each i, consider running M on the dataset X(i) where we replace the i’th row of X with the fixed value 0.)
Final Project Next Step: You will have received comments on your initial project sketch from homework 1. Using these comments, and rereading the “Final Project Guidelines”