This new survey is the first to comprehensively set out how synthetic data is enabling the use of data without sharing data.
Data about individuals, their unique characteristics, preferences, and behaviours, is increasingly abundant. As modern society runs on information flows, the power to deliver data-driven insights using this information is rapidly accelerating. The unprecedented availability of data, coupled with new abilities to utilise data, drives the frontiers of research and innovation, helping us address the most pressing issues of our time—from the climate crisis to the COVID-19 pandemic.
With increased data comes greater data governance responsibility: when using data that is personal or otherwise sensitive, there are inevitable obstacles to navigate—whether legal, technical, ethical or practical. To address these tensions, the Royal Society investigated the role of technology in responsible, privacy-preserving data governance with its 2019 landmark report Protecting privacy in practice: The current use, development and limits of Privacy Enhancing Technologies in data analysis (PDF). In that work we highlighted Privacy Enhancing Technologies (PETs) in enabling the derivation of useful results from data without providing wider access to datasets.
Now, in partnership with The Alan Turing Institute, the national institute for data science and artificial intelligence, and under the guidance of an expert Working Group, we are refreshing our work on PETs by considering new developments in both the technologies at play and the policy environment underpinning their implementation. Our new report will consider, with use cases, how these potentially disruptive technologies may play a role in addressing tensions around privacy and utility, enabling data use for wider public benefit. We also expand the remit of our PETs work to include synthetic data for its privacy-preserving potential.
Synthetic data, real insights
While real-world data is generated by real systems (such as medical tests or banking transactions), synthetic data is generated by using a mathematical model or algorithm. Synthetic private data can begin with real data. What makes synthetic data potentially privacy-preserving is that the synthetic dataset does not contain the exact datapoints of the original dataset (technically, it is possible that real datapoints are reproduced, but at random). Rather, synthetic data retains the statistical properties of the original dataset—or the ‘shape’ (distribution) of the original dataset.
Synthetic data can be generated so that it preserves information useful to data scientists asking specific questions (eg the relationship between medical diagnoses and a patient’s geolocation). At the same time, the anonymity of the original dataset is not compromised because it is impossible to determine which synthetic datapoints coincide with real ones (in fact, it is most likely that none of the ‘real’ datapoints are reproduced synthetically). This allows a researcher to make queries of the synthetic dataset (for example, trends in medical tests or banking transactions) without seeing sensitive information (such as an individual’s actual medical test results or financial transactions).
Synthetic data is better for privacy preservation than ‘anonymised’ datasets because removing identifiers is not always enough to safeguard confidentiality. Data scientists and adversaries use increasingly sophisticated methods to demonstrate the potential to link (often publicly available) anonymised datasets, demonstrating how the identity of data subjects can be revealed in anonymised datasets, even by those with little prior knowledge.
Synthetic data is thus said to hold a great deal of promise to enable insights where data is scarce, incomplete or where the privacy of data subjects needs to be preserved. It may also be ‘layered’ with other PETs. When used in Trusted Research Environments, for example, synthetic data may help researchers to refine their queries and build provisional models, therefore enabling experimentation while keeping safe any sensitive data (such as patient data in healthcare settings). So-called synthetic ‘dummy’ data can be used in hackathon-style events and help accelerate the development of new innovations (such as in financial services or health research) without risking any breaches in privacy.
New report on synthetic data: What, why, and how?
To investigate the potential for synthetic data in research and innovation, we commissioned a team of synthetic data experts at the Alan Turing Institute to prepare a survey detailing how synthetic data relates to ground truth data, as well as how it could be used in privacy-preserving data analysis and beyond. They also considered how synthetic data may be used to address gaps or biases in datasets, or to mitigate other limitations of ‘real’ data.
We are pleased to announce that this report—the first of its kind to provide a broad survey of synthetic data, its possibilities and limitations—is now in preprint and open access.
The report makes several novel observations of synthetic data. It finds that there are many possibilities for using synthetic data beyond privacy. For example, it has the potential to address issues of fairness, representation and bias through data augmentation. In a similar fashion, synthetic data can also be used to improve the robustness of machine learning systems by diversifying their training data.
However, there are trade-offs and challenges to be further examined, particularly in its use as a privacy-enhancer. The privacy potential of synthetic data depends on the way a synthetic dataset is generated. In other words, synthetic data is not necessarily private by default; rather, it must be generated in a way that sufficiently obscures specific information and individual datapoints. This entails a trade-off between privacy and utility/fidelity: the more accurate a synthetic dataset is, the more it becomes like the real dataset—and so becomes less anonymised.
Best practice could entail using a ‘real’ dataset to generate not one, but multiple synthetic datasets, each one to answer a different research question pertaining to fewer variables. Generating a single, all-purpose synthetic dataset based on an original dataset might appeal for simplicity and utility, but a synthetic dataset with many variables (or high-dimensionality) is more vulnerable to attacks on privacy and may leak information about the original data. A synthetic dataset with the complete utility of the original dataset—one which could answer all the same questions, and to the same degree of accuracy—would essentially be the original dataset, and would include the same privacy concerns.
Synthetic datasets are distorted versions of original datasets, meaning that outliers (or low-probability events) are difficult to capture in a privacy-preserving way: a synthetic data generator cannot accurately represent these cases while at the same time obscuring them. This trade-off between privacy and accuracy in the representation of the real data is a significant challenge. With currently available technologies, synthetic data must be produced on a use-by-use basis to optimise such trade-offs. Measuring privacy and accuracy to allow for quality guarantees is also difficult; there is no general way to do this and so it must also be done on a case-by-case basis.
Due to reduced accuracy (that is, utility and fidelity), this research finds that synthetic data should not be considered a replacement for ‘real-world’ data, but that it is well placed for use as a ‘tool to accelerate the “research pipeline”’. This could include the testing of algorithms or in first-stage mobile app development, for example, where the risk of using personal data outweighs the benefits.
These challenges demonstrate how, in some cases, PETs are not only mitigating the tensions inherent to data use, but also entail their own challenges to be addressed.
Conclusions
While data analytics enable new insights that can benefit society, the use of sensitive data entails new risks. It is increasingly possible to reveal personal information or commercially sensitive material, leading to significant potential harms for individuals and organisations. Clearly, synthetic data has a role to play in navigating the tension between maximising data utility and safeguarding privacy concerns—though not without trade-offs.
At the same time, synthetic data should be considered beyond its role as a privacy-preserver; as with other PETs, it can be a tool for accelerating research, increasing transparency and enhancing representation, enabling the fairer use of data more broadly. We look forward to exploring these themes, including the potential for synthetic data in real-world applications, in our forthcoming synthetic data explainer and wider report on the role of technology in data governance.