Run the Disclosure Risk Assessment
There are a number of different methods that can be used to evaluate the probability of individuals within a dataset being correctly re-identified. Watch the video to learn more about these different methods and how they are applied.
Key Takeaways
Use different assessment methods for continuous and categorical variables.
There are different disclosure risk assessment methods for continuous and categorical key variables. Assessing the disclosure risk for categorical key variables is based on the concept of uniqueness with more unique combinations of key variables (15, female, widowed) having a higher risk of disclosure. For continuous variables, variables that can take an infinite number of values, the concept of uniqueness of a key is not helpful because every respondent could have a unique value for these variables. Most disclosure risk measures for continuous variables are a posteriori measures. For this reason, they are not useful for assessing the initial disclosure risk but can instead be used to evaluate disclosure risk after the data has been treated.
Don’t forget continuous key variables.
We focus on categorical variables because they are more prevalent in humanitarian datasets but that doesn’t mean that you should ignore continuous key variables. One way to work with these variables in a disclosure risk assessment is to transform your continuous variable into categorical variables by creating intervals (income brackets, age ranges etc). If you don’t want to do this, outlier detection is one way to assess the disclosure risk of continuous key variables. You can apply Statistical Disclosure Control methods for continuous variables and then use risk assessment techniques like record linkage to evaluate the difference between the original and treated data.
Calculate the Individual Disclosure Risk.
The Individual Disclosure Risk is the probability that an individual within a dataset could be correctly re-identified. The main factors influencing the individual risk calculation are the sample frequencies (the number of individuals that share a combination of key variables in the sample) and the sample weights. When individuals with rare combinations of key variables also have small sample weights, they will have a high relative individual disclosure risk. In other words, if the number of individuals with this specific combination of key variables is expected to be low in the population, this increases the risk that they can be correctly re-identified.
Review k-anonymity, a common risk measure for categorical data.
To achieve k-anonymity there needs to be at least k individuals in the dataset that share a combination of values for the selected key variables. A record that has the same key as two other individuals in the dataset would satisfy 3-anonymity because there are at least three (k) individuals in the dataset with that key. A record that violates 2-anonymity is said to be a unique record because it is the only record in the dataset with that specific key.
Calculate the sample and population frequency of keys.
The unique combinations of key variable values are called keys. One way to assess disclosure risk is to calculate the frequency of different keys within the dataset and, if working with a sample, within the population. As a general rule, the more individual respondents that share a key, the lower the risk of a disclosure taking place.
Calculate the Global Disclosure Risk.
Individual disclosure risk measures are useful for identifying high-risk records. These individual risk measures can also be aggregated to obtain a global disclosure risk measure for the entire file. A straightforward way of calculating global risk is to take the average (mean) of the individual risks.
General Questions
Given long computation times for some methods, it is recommended, where possible, to first test the SDC methods on a subset or sample of the microdata, and then choose the appropriate SDC methods.
We recommend that you develop more than one disclosure risk scenario and conduct the assessment on each. To develop a disclosure scenario, you will need to think through the motivations of malicious actors, describe the data that they may have access to, and articulate how this and other publically data could be linked to your data and lead to disclosure.