1. Reidentification Attack
In the GitHub repo,[1] you will find the Public Use Micro Sample (PUMS) dataset from the 2000 US Census FultonPUMS5full.csv. This is a sample from the “Long Form” from Georgia residents, which contained many more questions than the regular questionnaire, and was randomly assigned to some individuals during the decennial Census. (It has since been replaced by a continuously collected survey known as the American Community Survey.)
Also in that folder is the codebook file for the PUMS dataset that lists the variables available in the release. Note this is the 5% sample which means that five percent of records are randomly sampled and released.
In the style of Latanya Sweeney’s record linkage reidentification attack,[2] propose a reidentification attack on the PUMS dataset by identifying demographic variables that, if known from another auxiliary source, could uniquely identify individuals. Note that while Sweeney used zipcodes as the geographic indicator, individuals in this Census release are identified by Public Use Microdata Areas (PUMAs) which are Census constructed geographic areas that contain at least 100,000 individuals. State the variables you would use, and provide an approximate back-ofthe-envelope calculation of the number of individuals who would be unique in that combination of variables in a PUMA region.
2. Reconstruction Attack
Among the variables in the 2000 PUMS dataset above is USCITIZEN, which asks the resident about their US Citizenship status. This is a sensitive piece of information, and including this question on the regular Census questionnaire has been a topic of recent controversy.[3] This PUMS dataset is public, but makes a good stand-in for a database that might be secured behind a query interface. We’ve provided a sample of size n = 100.
In this problem, you will run experiments to evaluate the performance of the reconstruction attack on determining individuals’ citizenship status. Treat the following variables in the dataset as public (so as an attacker you know them for all of the individuals in the dataset):
PUB =(SEX,AGE,EDUC,AGE,MARRIED,DIVORCED,LATINO,BLACK,ASIAN,CHILDREN, EMPLOYED,MILITARYSERVICE,DISABILITY,ENGLISHABILITY).
Each query in your attack should specify a boolean predicate p(PUB) ∈ {0,1} on the public variables (e.g. p(AGE/EDUC 4 && SEX == 0)), and receive as an answer an approximation to the value:
X
USCITIZENi,
i:p(PUBi)=1
where i ranges over the n = 100 individuals in the PUMS dataset sample, FultonPUMS5sample100.csv, that we have provided.
Your attack should make 2n queries, where each query corresponds to a different predicate pj, j = 1,...,2n.[4] Using the description of these predicates, the public data PUB1,...,PUBn, and the noisy answers to the queries, you should try to reconstruct the USCITIZENi bits for as many users as possible.
You will run experiments on how your attack performs against the following defenses:
Rounding: round each result to the nearest multiple of R for a parameter R
Noise addition: add Gaussian noise of mean zero and variance σ2, for a parameter σ, independently for each query.
Subsampling: randomly subsample a set T consisting of t out of the n rows, for a paremeter t, and calculate the answer using only the rows in T (scaling up by a factor of n/t).
Varying parameters R, σ, and T as integers from 0 to n, produce plots showing and comparing the trade-off between the accuracy of the statistics (measured by root-mean-squared-error between answers and exact values) and the average fraction of values USCITIZENi that are successfully reconstructed. For each parameter setting, run 10 experiments with fresh randomness and plot the average data points.
Make sure to identify the regime where your attack transitions from near-perfect reconstruction (fraction close to 1) to near-unsuccessful reconstruction (fraction close to 1/2). Add additional data points so that your graph is detailed around that transition point.
Note that you will be coding both the release mechanisms for each defense as well as the attack. The GitHub repo contains the code from the regression-based reconstruction attack from Monday’s class[5]. (Be sure to pull the most recent copy.) You can directly expand from this code if you are working in R, or use it as a template if you are working in Python.
BONUS: The above attack requires knowledge of all of the PUBi’s. Here we will sketch a version of the attack that only requires knowledge of a single PUBi and reconstructs USCITIZENi for that particular individual. For extra credit, fill in the details and implement the attack and measure its performance.
Above, we suggested using a random hash function of the form p(v) = ((Pd rdvd) mod P) mod 2 to select subsets of the dataset. Instead, consider taking P to be a prime of magnitude larger than n by a small constant factor (somewhere between 2 and 10), choosing r1,...,rd once and for all, and defining the hash function h(v) = Pd rdvd mod P. Since P is significantly larger than n, there will be few collisions of the PUBi’s under the hash function h. Now for each query pj, pick a random number sj ∈ {0,...,P − 1}, and define pj(v) = ((sj · h(v)) mod P) mod 2. Now you can do your linear regression with P variables, one for each possible value of h(PUB), since h is not changing across the queries. To attack a particular individual i, we look at the result of the regression for the variable associated with h(PUBi). (If the regression is too slow, feel free to use smaller values of n and P)
3. Membership Attack
Run a similar experiment to evaluate the effectiveness of the membership attack covered in class on the same sample of n = 100 from the PUMS dataset above. Specifically, find the highest level of accuracy (i.e. lowest RMSE) at which the expected fraction of bits that the reconstruction attack fails against all three defenses, where failure means reconstructing approximately 50% of the bits. Fix parameters for each of the three defenses that correspond to this level of accuracy, and produce a graph of the number of queries issued vs. the true positive probability of the membership attack (i.e. the probability that the attack says “IN” when Alice is a randomly chosen member of the dataset). You can use membershipAttackCompleted.r as a template, which contains the membership attack from lecture including all the modifications made during lecture. Here are guidelines for carrying out the attack:
We can think of the binary values in the membership attack described in class either as actual attributes or the results of Boolean predicates applied to the attributes. Since there are not enough actual attributes in the PUMS dataset to run a membership attack, create derived attributes in the following way. For the jth “attribute” of user i in the membership attack, use the predicate pj(PUBi), where pj is a random predicate generated in the same way that you did in the reconstruction attack.
Feel free use to use counts or means as your statistics, as they are equivalent up to a scaling by a factor of n. If you use means, be sure to scale the accuracy threshold you use accordingly.
Increase the number of queries/attributes until either the true positive probabilities start to converge or it becomes computationally infeasible.
Below we will mostly use notation from the membership attacks lecture, but we’ll use m for the number of queries (since above we used d for the number of attributes in PUB) and ρ = (ρ1,...,ρm) for the population probabilities (since above pj denotes the j’th predicate).
To calculate the vector ρ of population probabilities, you can either use the full Fulton Georgia PUMS dataset that we have provided (csv consisting of 25,766 individuals) or do an analytic calculation based on the random predicates you use.
Set the false positive probability to be δ = 1/10n. To determine the corresponding threshold T = Tρ,a, you can approximate the null distribution of your test statistic either using the resampling method shown in class on 2/11 or a normal approximation N(0,σ2) where σ2 is the variance of your test statistic. [6] Check that you are indeed achieving a small enough false positive probability by running your membership attack on some randomly chosen members of the full population dataset.
4. Final Project Ideas
The final projects are a important focus of this course, and we want you to start thinking about yours as soon as possible. Please read the “Final Project Guidelines” (http://seas.harvard. edu/~salil/cs208/spring19/project-guidelines.pdf) document on the course website and submit about a paragraph as described in the “Topic Ideas” bullet.
Codebook for Census PUMS 5 Percent CS208 Datasets
X.1
Deprecated, removed from dataset
state
The US State of residence.
puma
The Public Use Microdata Area, a Census constructed region of about 100,000 persons.
jpumarow
Deprecated, removed from dataset
serialno.household
Deprecated, removed from dataset
sex
0: Male,
1: Female.
age
Age in years.
educ
1: No schooling completed,
2: Nursery school to 4th grade,
3: 5th grade or 6th grade,
4: 7th grade or 8th grade,
5: 9th grade,
6: 10th grade,
7: 11th grade,
8: 12th grade, no diploma,
9: High school graduate,
10: Some college, but less than 1 year,
11: One or more years of college, no degree,
12: Associate degree,
13: Bachelor’s degree,
14: Master’s degree,
15: Professional degree, 16: Doctorate degree.
income
Person’s total income.
latino
0:
Not Hispanic or Latino,
1:
Hispanic or Latino.
black
0:
Not Black or African American,
1:
Black or African American, alone or in combination with one or more other races.
asian
0:
Not Asian,
1:
Asian, alone or in combination with one or more other races.
married
0:
Presently married, not separated,
1:
Widowed, divorced, separated, never married.
divorced
0:
Married or not married but not divorced,
1:
Divorced and not remarried.
uscitizen
0:
Not a citizen of the United States,
1:
Citizen of United States.
children
0:
No own minor children living in residence,
1:
Lives with own minor children.
disability
0:
Without a disability,
1:
With a disability (sensory, physical, mental)
militaryservice
0:
No military service,
1:
Past or present active duty service, or training for reserves or National Guard.
employed
0:
Unemployed or not in labor force,
1:
Employed, including armed services.
englishability fips
0: 1:
Spoken English ability is ”First Language”, ”Very Well” or ”Well”,4 Spoken English ability categorized as ”Not Well” or ”Not at all”.
Federal Information Processing Standards County Code.