The mosaic effect occurs when multiple datasets are linked to reveal significant new information. While such information could be used to gain insight, it could also be used by bad actors to do harm. In a humanitarian context, this could happen through the combination of key variables, such as age and gender, from different surveys to reveal the identity and location of people from an ethnic minority, for instance. The challenge is to understand when this can occur and what to do about it.

Given that the Humanitarian Data Exchange (HDX) now hosts almost 20,000 public datasets, we have been considering how the mosaic effect might apply to the platform. The HDX team checks all datasets for personal or sensitive data, and uses a process to assess disclosure risk of individual datasets. If there is a risk of re-identifying a person or group, we offer to work with the contributing organization to anonymize the data so that it can be shared. This generally involves microdata or data that is collected from surveys or needs assessments. The HDX team does not currently assess the potential for information disclosure across multiple datasets. 

In partnership with the Technical University of Delft HumTech Lab, we took a deeper look at this issue, starting with a literature review. The term ‘mosaic effect’ was first referenced in legal documents pertaining to data use in intelligence agencies. The method that allows for the mosaic effect is referred to as a ‘data linkage attack’. 

The following are four types of linkage attacks, as applied to humanitarian operations:

  • A database cross match where two datasets are combined through several variables. This could occur when two humanitarian organizations publish datasets related to the same refugee population.
  • A specific individual match where the intention is to obtain additional information about a targeted individual. This could take place when an actor already has some information about an individual (such as their date of arrival in a camp).
  • An arbitrary individual match where the goal is to discredit a humanitarian organization by showing that affected people can be identified in their data.
  • A specific group match where a group is identified through the combination of datasets. For example, insurgents might look to recruit young men within a certain age range.

Understanding these attack types helps identify risks related to the publication of humanitarian data. As part of our collaboration, masters students from the Technical University of Delft examined the potential linkages of over 200 datasets on HDX related to one humanitarian crisis. The students explored the data through ‘specific group matching’ and ‘database cross matching’. They also develop a network graph showing areas of likely disclosure based on the density of the linked datasets.

One of the Centre’s 2020 Data Fellows, Carol McInerney, is taking this work forward by exploring methods to assess and manage the mosaic effect on HDX. She has focused on a data environment analysis, which enables the identification of datasets that have information in common, a prerequisite for data linkage. This type of analysis is typically the most resource-intensive step of a disclosure risk assessment, and can be done through studying networks of linked datasets. 

The following image shows a network analysis of data from one humanitarian response context. Each node represents one dataset and an edge indicates that two datasets have column headers in common. To create this network, over 400 datasets were analysed, revealing that over 90% (377) of those were found to share information with at least one other dataset. 

Network Image: Carol McInerney

As a deliverable of her fellowship, Carol is creating a prototype tool and methodology to help the HDX team analyse the existing connections between relevant datasets already on the platform. The tool will then show how this network changes with the addition of a new dataset. The methodology will be validated on a subset of datasets relating to migrant communities. A blog about her work will be released in September. 

To join us at the Data Fellows Showcase on Thursday 30 July and hear Carol discuss more of her work, please register here. If this topic is of interest to you or you have questions about the mosaic effect, write to us at centrehumdata@un.org