Selecting Your Key Variables
The first step in a disclosure risk assessment is the selection of key variables. These are the variables, or the columns in your dataset, that are most likely to lead to the disclosure of confidential information, including an individual’s identity. Watch this video to learn more about different types of variables and how to select your key variables.
Key Takeaways
Classify your variables as identifying and non-identifying.
Identifying variables contain information that can lead to the identification of respondents in the dataset. These can be further categorized as either direct identifiers or indirect identifiers (also referred to as quasi-identifiers). Remember, direct identifiers such as full names, addresses, phone numbers and GPS coordinates should always be removed from the microdata before starting the risk assessment.
Select your key variables.
A key variable is typically an indirect identifier that could be used to re-identify individuals within a dataset or to link records between different datasets. Common examples of key variables are age, marital status, geographical variables, gender, and religion. Removing all indirect identifiers from a dataset is likely to severely limit the analytical value of the dataset. The SDC process is intended to assess the disclosure risk presented by the indirect identifiers and to take steps to limit that risk while maintaining the analytic power of the data.
Note whether your key variables are continuous or categorical.
You will use different techniques to assess the disclosure risk of continuous and categorical variables. Categorical variables take values from a finite set (i.e. gender) whereas continuous variables are numeric variables that can take an infinite number of values (i.e. income). Continuous variables can be transformed into categorical variables by creating intervals (i.e. income brackets).
The sensitivity of indirect identifiers depends on the context.
Direct identifiers are always considered sensitive while the sensitivity of indirect identifiers is often context-specific. This is why it is important to understand both the data environment and the real-life situation when selecting your key variables. Keep in mind that even when indirect identifiers are not themselves sensitive, it may be possible to combine them with other variables to lead to the disclosure of sensitive information.
Pay close attention to exclusive or partial variables.
While you do not want to remove all indirect identifiers, it may be important to remove some. For example, you may want to consider removing variables with many missing values, such as a variable recorded only for a select group.
General Questions
Key variables are your indirect identifiers that are most likely to lead to a disclosure whereas keys are all the unique combinations of values those indirect identifiers take. For the key variables ‘Marital Status’ and ‘Gender’ you could have keys such as ‘Married, Female’, ‘Married, Male’ and ‘Single, Female’. The number of times, or the frequency, a given key appears in a dataset is the basis for many disclosure risk measures.
Selecting key variables does take some practice. When in doubt, we recommend you working with a few colleagues to do the selection. You can also select different sets of key variables and run a disclosure risk assessment on each. Finally, remember that it is important for you to have an understanding of the data environment before selecting the key variables. Selecting key variables correctly requires you to make assumptions about the data that others are likely to have access to as well as whether specific data is sensitive in your context (even if it might not be considered sensitive in another context).