LARS VILHUBER: Welcome to J-PAL's IDEA Handbook webinar series. I'm Lars Vilhuber, an economist at Cornell University and co-chair of J-PAL's IDEA Initiative. These webinars accompany the release of the Handbook on Using Administrative Data for Research and Evidence- based Policy, funded by the Alfred P. Sloan Foundation.
The Handbook provides a series of technical guides and case studies on how to make administrative data more accessible for research and policy. Today it is my pleasure to introduce Alexandra Wood, one of the co-authors of the chapter, Designing Access with Differential Privacy. Alex is a fellow at the Berkman Klein Center for Internet and Society at Harvard. And her work explores the interface between privacy and regulatory frameworks, both existing and emerging. In today's presentation Alex will discuss the use of differential privacy tools to protect subjects in the administrative data collected, analyzed, and published for research purposes. Also the survey data.
In particular she will discuss how to determine whether the use of differential privacy is an appropriate solution in a given setting, with design considerations one should take into account when implementing differential privacy. Differential privacy is embedded within a framework here that can and should accommodate other tools to protect privacy, such as consent mechanisms, data use agreements, and secure environments. With that, I'd like to invite you to listen to Alex's talk. ALEXANDRA WOOD: In this presentation, I'll be discussing joint work with my colleagues Micah Altman, Kobbi Nissim, and Salil Vadhan on designing access with differential privacy. My goal is to provide a brief conceptual overview of our chapter from the Handbook on Using Administrative Data for Research and Evidence-based Policy.
For anyone interested in a more in-depth discussion of the design and deployment of differential privacy and real world case studies and resources for implementation, I hope the Q&A after this talk and the full text of the Handbook will be helpful resources. The chapter describes how administrative data containing personal information can be collected, analyzed, and published for research and evidence based policy-making in a way that ensures the individuals in the data will be afforded the strong protections of differential privacy. So why focus on differential privacy? Starting with the work of Latanya Sweeney in the late 1990's, numerous examples show that the traditional approach of removing pieces of information considered to be personally identifying often leaves data sets still vulnerable to re-identification, where an adversary can use auxiliary data sources to link a record to an identifiable person. Another example is what is called a database reconstruction attack, where an adversary reconstructs almost all of the sensitive data from the released statistics. As an example, a few years ago the Census Bureau ran internal experiments and determined that the billions of statistics they released in the form of statistical tables from the 2010 Decennial Census could be combined to carry out a database reconstruction. They showed it was possible to use the statistical publications to narrow down the possible values of individual level records and reconstruct with perfect accuracy the responses reported by 46% of the US population.
They were also able to accurately re-identify the records of at least 17% of the population. And this demonstration led the Bureau to decide to adopt differential privacy for 2020 Census. Another example is a membership inference attack, where an adversary can determine with a high statistical confidence whether someone is in a sensitive data set. The possibility of such attacks led NIH to take down some of the genetic statistics that it had been making available. And what may be particularly surprising is that the latter two examples show that releases of what appear to be simply aggregate statistics can also be subject to privacy attacks. We can take away some lessons from this growing body of evidence showing that traditional approaches to privacy failed to provide adequate protection.
First, we know that the redaction of identifiers such as names, addresses, and social security numbers, is insufficient. Examples have repeatedly shown that attributes initially believed to be non-identifying may in fact, make it possible to re-identify individuals in the future. This is also true for other traditional statistical disclosure limitation techniques like aggregation and noise addition. We also know that auxiliary information needs to be taken into account, including hard to anticipate future data sources that could be used for re-identification. And we see that existing approaches in regulation and practice have considered a limited scope of privacy failures like re-identification through record linkage, using identifiers known to be present in auxiliary data sources like public records.
But they typically ignore other attacks like reconstruction, membership inference attacks, and other attacks that seek to infer characteristics of individuals based on information about them in the data. We now know that any useful analysis of personal data must leak some information about individuals. And that these leakages of personal data accumulate with multiple analyses and releases.
This is known as the effect of composition. It's a fundamental law of information, that privacy risk grows with the repeated use of data. And this applies to any disclosure limitation technique. Moreover, these are unavoidable mathematical facts.
They're not matters of policy. And they underscore the need for privacy technologies that are immune, not only to linkage attacks but to any potential attack, including those currently unknown or unforeseen. One might take away from this long series of attacks and the theoretical understanding that all analysis of personal data leak some information about individuals, that it's impossible to enable access to administrative data while protecting privacy. Fortunately that's not the case. Rather, what we learn from all of this is that we need a rigorous approach to privacy if we want to guarantee protection in a dynamically changing data ecosystem in which new data sources and analytical techniques are continually emerging.
Research in this direction has yielded a new formal mathematical model of privacy called differential privacy, which was introduced in 2006. It's supported by a rich body of theoretical research and incorporates new privacy concepts that are distinct from heuristic approaches that have traditionally been used. And it provides a mathematically provable guarantee of privacy protection.
It's now in its first stages of implementation and real world use by statistical agencies, such as the US Census Bureau, and by tech companies such as Google, Apple, and Uber. So what is differential privacy? The first thing to note is that it's not a single tool, but a definition or a standard for privacy in statistical and machine learning analysis which many tools have been devised to satisfy. This means that with differential privacy, statements about risk are proved mathematically. Any analysis meeting the standard provides provable protection against a wide range of privacy attacks, including the reconstruction of membership inference attacks I mentioned as well as cumulative loss through compositional effects. It also has a compelling intuitive interpretation. Any information related risk to a person should not change significantly as a result of that person's information being included or not in the analysis.
And here's an intuitive illustration of the definition. Consider first a real world computation in which an analysis is being run on a data set and it produces an outcome. Then as a thought experiment, imagine an ideal world in which an individual who's very concerned about privacy would want their information to be omitted from the data set entirely to ensure with absolute certainty that the outcome of the analysis does not leak any personal information specific to them. But in order to satisfy each individual's ideal world scenario, and therefore provide perfect privacy protection to everyone in the data set, the personal information of every individual would need to be removed.
And this would make the outcome of the resulting analysis useless. Instead, differential privacy aims to protect any arbitrary individual's privacy in the real world computation in a way that mimics the privacy protections they're afforded in their ideal world scenario. It allows for a deviation between the output of the real world analysis and that of each individual's ideal world scenario. And a privacy loss parameter, epsilon, quantifies and limits the extent of this deviation. Differentially private analysis can be deployed in settings such as research and evidence based policy-making, in which an analyst seeks to learn about a population, not individuals.
To achieve differential privacy, carefully crafted random statistical noise must be injected into the computation. Differential privacy is robust to auxiliary information, meaning an attacker utilizing arbitrary auxiliary information cannot learn much more about an individual in a database than they could if that individual's information were not in the database at all. It's also robust to composition, meaning it provides provable bounds with respect to the cumulative risk from multiple data releases. And it's robust to post-processing, meaning it cannot be made ineffective through further analysis or manipulation. Differentially private computations are used in statistical releases, like publications of statistical tables, training machine learning models, and generating synthetic data sets that reflect the statistical properties of the original data. Some differentially private systems field queries from external analysts, allowing them to perform custom, not previously anticipated analysis, like running a regression on variables of their choice.
And statistical query systems like this generally are in use by both government agencies and industry as a means for allowing the public to study sensitive data. Examples of statistical query systems include releases of survey data from the US Census Bureau, education data from the National Center for Education Statistics, genomic data from the National Institutes of Health, and search trend data from Google. A wide variety of different types of analysis can be performed using tools for differentially private analysis.
Examples include various types of descriptive statistics, such as counts, means, medians, histograms, contingency tables, and cumulative distribution functions. Supervised and unsupervised machine learning tasks like classification, regression, clustering, and distribution learning. And in many cases, it's possible to generate differentially private synthetic data that preserves a large collection of statistical properties of the original data set. However, due to the addition of noise, differentially private algorithms work best when the number of data records being analyzed is large.
We illustrate how differential privacy is achieved here with a stylized example. These figures show the outcome of a differentially private computation of the cumulative distribution function of income in a fictional district cue. The graph on the left shows the original CDF without noise. And on the right, we see the result of applying a differentially private computation of the CDF with an epsilon value of 0.005.
A smaller value of epsilon like this implies better privacy protection, but also less accuracy due to noise addition compared to larger values of epsilon. Notice how as epsilon is increased we see improved accuracy. It's important to note that every computation leaks some information about the individual records used as input regardless of the protection method used. The inevitability of privacy loss implies that there is an inherent trade-off between privacy and utility. Differential privacy provides explicit formal methods for defining and managing this cumulative loss, referred to as the privacy loss budget. It can be thought of as a tuning knob for balancing privacy and accuracy.
Each differentially private analysis can be tuned to provide more or less privacy, resulting in less or more accuracy respectively by changing the value of this parameter. Also, as the number of observations in a data set grows sufficiently large, the loss in accuracy due to differential privacy can become much smaller than other sources of error such as statistical sampling error. Further, the combination of epsilon differentially private computations results in differential privacy, albeit with a larger epsilon, which is what enables it to account for composition effects. It's important to note that the use of a privacy loss budget is a key feature, not a bug.
Ignoring this unavoidable fact of the cumulative effect of privacy loss, as traditional privacy approaches have done, does not prevent this effect-- much like ignoring your car's fuel gauge does not prevent you from ever running out of fuel. The differential privacy guarantee can be understood in reference to other privacy concepts. For example, differential privacy can be understood as providing individuals with an automatic opt-out.
They're protected essentially as if their information was not used in the computation at all. It can also be understood as ensuring an individual incurs limited risk. For example, contributing my real information to an analysis can increase the probability I'll be denied insurance by at most 1% when compared with not participating or contributing fake information. These guarantees are provided independent of the methods used by a potential attacker and in the presence of arbitrary auxiliary information.
They are also future proof and avoid the penetrate and patch cycle of heuristic approaches to privacy protection. Differential privacy also has some key benefits in terms of transparency. It's not necessary to maintain secrecy around a differentially private computation or its parameters.
This offers benefits such as the possibility of accounting for differential privacy in statistical inference, and the accumulation of knowledge to improve differentially private algorithms, as well as enabling scrutiny by the scientific community. And tools that achieve differential privacy can be used to provide broad public access to data or data summaries in a privacy preserving way. Used appropriately, these tools can in some cases also enable access to data that could not otherwise be shared due to privacy concerns, and do so with a guarantee of privacy protection that substantially increases the ability of the institution to protect the individuals in the data. Whereas, traditional techniques would more often require applying controls in addition to de-identification. To understand how the use of differential privacy makes it possible to reason about risk, consider this stylized example. Gertrude, who is represented here as Pablo Picasso's Gertrude Stein, is a 65-year-old woman.
And she's considering whether to participate in a medical research study. She's also concerned that the personal information she discloses over the course of the study could lead to an increase in the premium on her $100,000 life insurance policy. Her life insurance company has set her annual premium at $1,000, or 1% of the value of her policy, based on actuarial tables that show that someone of her age and gender has a 1% chance of dying in the next year.
Imagine that she decides not to participate in the study, but the study finds that coffee drinkers are more likely to suffer a stroke than non coffee drinkers. Gertrude's life insurance company might learn about this study and update its assessment concluding that as a 65-year-old woman who drinks coffee, Gertrude has a 2% chance of dying in the next year. The insurance company might accordingly decide to increase her annual premium from $1,000 to $2,000.
In this hypothetical, the results of the study led to an increase in Gertrude's life insurance premium, even though she didn't contribute any personal information to the study. This is her baseline risk. It's likely unavoidable because she can't prevent other people from participating in the study. But what she's concerned about is whether her risk can increase above this baseline risk. She's worried that the researchers can conclude, based on the results of medical tests over the course of the study, that she specifically has a 50% chance of dying from a stroke in the next year.
If the data from the study were to be made available to her insurance company, it might decide to increase her insurance premium from $2,000 to more than $50,000 in light of this discovery. In an alternative scenario, imagine that instead of releasing the full data set from the study, the researchers release only a differentially private summary. Differential privacy guarantees that if the researchers use a value of epsilon of 0.01, then the insurance company's
estimate of the probability that Gertrude will die in the next year can increase from 2% to at most 2.04%. This means that Gertrude's insurance premium can increase from $2,000 to, at most, $2,040. So the first year cost of participating in the research study in terms of a potential increase in her insurance premium is at most $40.
One can also generalize from this scenario and view differential privacy as a framework for reasoning about the increased risk that's incurred when an individual's information is included in an analysis. Of course, outside of a stylized example like this, calculating one's baseline is very complex if it's even possible at all. Consider for example, how the 2% baseline risk, which is the insurer's belief about Gertrude's chance of dying in the next year, is dependent on the results from the study, which Gertrude has no way of knowing at the time she makes her decision whether to participate in this study. Baseline may also depend on many other factors Gertrude doesn't know. However, differential privacy provides guarantees relative to every baseline risk.
This means Gertrude can know for certain how participating in the study might increase her risks relative to opting out, even if she doesn't know apriori all of the privacy risks posed by the data release. This enables Gertrude to make a more informed decision about whether to take part in the study. For example, she can calculate how much additional risk she might incur by participating in the study over a range of possible baseline risk values and decide whether she's comfortable with taking on the risks entailed by these different scenarios.
Also, the guarantee covers not only changes in her life insurance premiums but also her health insurance and more. Our chapter also covers topics relevant to practitioners considering implementing differentially private solutions at their institutions. It's important to note that a number of challenges come with making this transition.
For example, deploying differential privacy often changes how data is accessed by data users. It requires the addition of noise and attention to privacy utility trade off. And the privacy loss budget may require shifting from static data publications to interactive modes of access, especially in light of certain utility considerations. There are also implications for the data lifecycle.
The privacy loss budget is an unavoidable mathematical fact, but setting the budget is a policy question. These considerations have potential implications for collection, storage, transformation, and retention. They also implicate legal requirements as well as technical ones. The chapter uses modern privacy frameworks to explain how to determine whether the use of differential privacy is an appropriate solution in a given setting. And it characterizes principles for selecting differential privacy in conjunction with other controls. These principles include calibrating privacy and security controls to the intended uses and privacy risks associated with the data, and anticipating, regulating, monitoring, and reviewing interactions with data across all stages of the lifecycle, including the post access stages as risks and methods will evolve over time.
It also considers the design considerations one should take into account when implementing differential privacy. This covers tools designed for tiered access systems, combining differential privacy with other controls, such as consent mechanisms, data use agreements, and secure environments, and it discusses the implications of these design choices for regulatory and policy compliance. We also look at a series of use cases. The chapter provides three case studies to illustrate real world implementations of differential privacy, and the design choices that can be made to suit a variety of different contexts. The first case study looks at the deployment of differential privacy in the disclosure avoidance mechanism for the 2020 decennial census.
In this case study data products are low dimensional and the sample size is very large, making it a good fit for the use of differential privacy. But at the same time the implementation is challenging because the data need to be able to support a wide and diverse range of possible applications. And making the noise addition transparent when sources of error in the decennial census data products have historically not been made explicit, creates concerns for data users. Census Bureau has decided to produce differentially private data products that have the same form as the traditional products, including tables that are exactly consistent with an underlying synthetic data set, along with other information that needs to be published exactly such as the state population totals.
This required the design of custom differentially private algorithms by experts at the Bureau. The second case study is the Opportunity Atlas, which is a web based visualization tool for exploring social mobility data based in part on administrative records. In this case study there was a single set of analyses to perform, small sample sizes, even on the order of tens, hundreds, and thousands, and the privacy budget was not reserved for future analyses.
Researchers were able to produce good results using a differential privacy inspired method. And in terms of accuracy, it even performed better than some traditional statistical disclosure limitation techniques. The third case study is the Dataverse project. This case study looks at work to incorporate differential privacy into research data repositories to help human subjects researchers safely share and analyze sensitive data. In this case study, the vision is for differential privacy to enable exploration of sensitive data sets deposited in a research repository, to facilitate reproducibility of research with sensitive data sets, and to enable statistical analysis of sensitive data sets accessible through the repository.
More technical discussions of several topics are included in an extensive online appendix. It includes a discussion of different technical approaches to disseminating data with differential privacy, and a characterization of the key design choices and trade offs among them, including a review of trust models, privacy loss settings, privacy granularity, static versus interactive publications, and estimating and communicating uncertainty. It also includes a discussion of the technical and legal implications for data collection, use and dissemination, with a special emphasis on how differential privacy affects data collection and data repository practice and policy. It also includes a list of select tools and resources for implementing differential privacy, including open software for differentially private analysis and enterprise software solutions and further readings.
In summary, we know from accumulating failures that anonymization and traditional statistical disclosure limitation techniques are not enough to adequately protect privacy. Differential privacy is a standard providing a rigorous framework for developing privacy technologies with provable, quantifiable, guarantees. And when moving to practice, differential privacy works best when combined with other technical and policy tools. And the full chapter provides guidance for design and deployment.
2021-02-27