The United States census has always required that its respondents cannot be identified from any tabulation published during the lifetime of those respondents. Over time, the level of geographic detail published by the Census Bureau has been greatly enhanced – with the introduction of census tracts, block groups, and the highly detailed block level. With nearly three million populated blocks nationwide, it is easy to imagine a block where you could have just one person enumerated who fits a particular category – thus making identification rather an easy task. Where it really has an effect is on the tables we all rely upon – households by income, age by sex, or householder age by householder income.
The seriousness of the problem is easily demonstrated. If the average block has about 150 people, and we have an age-sex table of 19 age groups by 2 sex groups, we have an expected value in any particular cell (e.g. age 25-29, Male) of just under 4 if population was evenly distributed over the age groups. We all know that it is not so, and in the upper age brackets, we will be surprised if we don’t find any cells with 0 or 1 values. The zero value is obviously not a problem unless you are into some weird existential philosophy, but the 1 value means that it may be possible to identify an individual.
These days, with the sheer power of computing and advanced machine learning classification algorithms, it is increasingly difficult to ensure that a data release will not allow the identification of a particular individual, especially when combined with ancillary data such as public lists or mobile phone records.
Several methods commonly used on cross-tabulations are:
- Data suppression, where a cell is marked suppressed if it contains less than a threshold count. For many tables, it allows you to determine the differences between 0, 5, and “something in between”.
- Table aggregation, where the details of the tabulation are reduced as you become more geographically focused by aggregating selected rows and/or columns of the table.
- Random rounding, a technique widely used by Statistics Canada, where all numbers in a table are pseudo-randomly rounded to the nearest 5, so a number ending in 4 would have a 20% probability of being rounded down, 80% rounded up. Tables do not necessarily sum to the totals published at higher levels of geography, or even to totals published for the same level of geography, although in a “controlled rounding” the cell values can be forced to sum to the totals.
Privacy can be easily defeated by good clustering or iterative proportional fitting algorithms that use the higher order geographic totals to estimate the suppressed or modified values at a more detailed area such as a block group. Further, information in the table is differentially impacted, in that large values are left alone while small values are simply not shown or are disproportionately adjusted. Finally, it is impossible to tell how much error has been injected into the table in the first place.
The approach being taken by the statistical agencies of the United States, including the Census Bureau for the 2020 Census, is based on the concept of “differential privacy”. A good brief overview of the issues and the proposed resolution can be found at https://www2.census.gov/about/policies/2019-11-paper-differential-privacy.pdf. The crux of the matter is nicely summarized as “every time you release any statistic calculated from a confidential data source you ‘leak’ a small amount of private information. If you release too many statistics, too accurately, you will eventually reveal the entire underlying confidential data source.”
The goal is to maximize the usability and accuracy of data release while minimizing the relative privacy risk by setting a privacy risk budget. Each release of a statistic related to a confidential record uses some of that budget. Record swapping, introduced several decades ago in the Census, will be applied on a larger scale, but one which is quantifiable in terms of the amount of error injected into the results. The main benefit for users is that we will now know exactly how much “noise” has been injected into a table. If a table requires too much noise injection rendering it unusable, the table would be suppressed.
For most users, the practical implications are threefold –
- Most tables published at detailed geographic levels will still contain injected error or noise as they have in the past
- The amount of noise injected into a table will be published, expressed as a percentage.
- Detailed cross-tabulations (such as age by sex by race) or highly detailed tables (ancestry) may be suppressed because either the amount of noise to be injected exceeds the thresholds established by the Census or because releasing those would exhaust the privacy budget for one or more elements or respondents.
A complete treatment of how data privacy concerns are being addressed by federal agencies, take a look at https://nces.ed.gov/FCSM/pdf/spwp22.pdf?#.