The Double-Edged Sword of Synthetic Data in AI Training

Training data serves as the foundation on which AI and machine learning models learn to perform tasks. However, obtaining diverse and representative real-world data can be costly,  time-consuming, and potentially illegal or unethical. As a result, organizations have recently explored the use of synthetic data to train their models. 

In this blog, we explore the implications of using synthetic data to train AI models by looking at the paper “Self-Consuming Generative Models Go MAD” published in July 2023 by researchers from Rice University and Stanford University (1). We also identify several scenarios where the use of synthetic data might be more practical or even unavoidable than alternatives. 

Overall, the referenced paper shows that AI models progressively trained on synthetic data – and without enough fresh, real data – will ultimately lead to degradation around model quality (model precision) and diversity (the range of types of images/texts generated). When real data is scarce or risky or unethical to obtain, synthetic data may be necessary or more practical and the benefits of reducing model error for specific tasks may outweigh the associated challenges. In these cases, there are options for practitioners to get the most out of synthetic data. First, the paper found that this model degradation will occur at varying velocity depending on the level of fresh real data supplemented. More specifically, including a fixed real dataset will delay the effect, and supplementing with enough fresh, real data will prevent it entirely. Additionally, introducing certain sampling biases to select higher quality training data can help preserve model quality, but at the expense of model diversity (meaning less representation of edge cases). 

Examining Synthetic Data 

Synthetic data refers to data that is artificially generated on a computer (typically through statistical methods or AI models) rather than obtained by direct measurement or real-world collection. This type of data is created algorithmically, often using various modeling techniques that are designed to mimic the statistical properties of real data, and is typically used to train AI models. Examples of synthetic data include images for computer vision and text for natural language processing. A recent Gartner study predicts that by 2024, 60% of all data used in AI and analytics-type projects will be synthetically generated (2).

As synthetic data has rapidly proliferated with increasing use of GenAI, the significance of “ground truth” data – and the ramifications of deviating from ground truth – in data science is being closely examined by researchers. How closely does synthetic data mimic real-world data in terms of accuracy and diversity, and can models trained on this data generalize effectively to real-world scenarios? What happens to the quality of a model when it is trained on substantial amounts of synthetic data? What happens if AI models are trained on their own outputs? Although this article approaches the subject from a technical perspective, there is also an important fundamental challenge to our understanding of what constitutes truth in a world where more and more content is AI-generated. 

Gen AI goes ‘MAD’ – A Recent Study 

In July 2023, researchers at Rice University and Stanford published the paper referenced above, titled “Self-Consuming Generative Models Go MAD”, examining the impact of training AI models with synthetic vs. real data. Three variations of data loops are discussed, each representing a different way synthetic and real data are combined in training sets over generations: fully synthetic loop, synthetic augmentation loop, and fresh data loop.

A) fully synthetic loop: each generation's training dataset is composed exclusively of synthetic data derived from preceding generations' models.

B) synthetic augmentation loop: the training dataset for each generation comprises a mix of synthetic data from earlier models and a fixed (as opposed to fresh) set of real training data. In this loop, the same set of real training data was used across generations.

C) fresh data loop: training dataset that includes synthetic data from previous generations' models alongside a fresh, new batch of real training data.

Overall, the study found that for each of these variations, without enough fresh, groundtruth data in each generation of model training, future generative models face a decrease in their quality (precision) or diversity (recall). In other words, the researchers observed significant, progressive reduction in core capabilities of the models through each generation in loops without enough fresh, real data. The authors coined the term Model Autophagy Disorder (MAD) to describe this state where a self-consuming loop leads to a decline in the generative models' capabilities. “Autophagy” is a term borrowed from biology referring to a self-preservation mechanism in cells.

The study also found that the quality of a training dataset can be controlled by a sampling bias, yet at the expense of data diversity. When building a training dataset, practitioners or algorithms will present a sampling bias by selecting images deemed higher quality. This bias in turn helps to maintain the quality of the model, yet inevitably limits the range of types of images and texts introduced. The authors state that “without sampling bias, autophagy can lead to a rapid decline of both quality and diversity, whereas, with sampling bias, quality can be maintained but diversity degrades even more rapidly.”

Note that all models that are not updated with new data (fresh or synthetic) will always face degradation. This effect, referred to as model drift, is caused by factors other than those described here. 

When the Benefits of Synthetic Data May Outweigh Risks

While we’ve seen that relying only on synthetic data can cause major problems, it's important to acknowledge the practicality and possible necessity of synthetic data in certain contexts.

Some specific examples include:

  • Filling in model blindspots: models often have blind spots due to a lack of diversity in the training data or an overrepresentation of certain classes or outcomes. This can lead to biased predictions or model errors. Synthetic data can be used to create samples for underrepresented scenarios, increasing model accuracy and providing a more balanced dataset for the model to learn from. Medical studies, for example, may be performed on the population at a particular institution, but that population may not be representative of the population at large - which may lead to incorrect predictions. Often, the benefit of filling these blindspots (and reducing errors) outweighs the risks associated with synthetic data.

  • Scenarios involving dangerous conditions: There are situations where using synthetic data avoids exposing individuals to risk, such as training autonomous vehicles for emergency responses or simulating natural disasters for urban planning.

  • Sensitive or difficult-to-obtain data: Certain data types – such as in highly specialized medical research or rare astronomical phenomena – may be difficult or impossible to obtain, necessitating use of synthetic data. 

  • Population studies: A practice known as Bayesian Improved Surname Geocoding (BISG) is used often to help U.S. organizations produce accurate, cost-effective estimates of racial and ethnic disparities using census datasets. This approach is used in voting analysis, health studies and insurance underwriting.

In Conclusion

An exploration of synthetic data's role in training AI models reveals that training models only on synthetic data comes with some risks, as evidenced by the “Self-Consuming Generative Models Go MAD” paper. The benefits of synthetic data are enticing, but organizations should understand that model degradation is a real possibility when training on synthetic data unless it is combined with enough fresh, real data. We discussed clear scenarios where there is no way around synthetic data – including times where real data is scarce, its collection is risky or unethical, or lack of data is leading to model error. Organizations should implement general best practices around model management when training with synthetic data to minimize these performance challenges.

(1) https://arxiv.org/abs/2307.01850
(2)
https://blogs.gartner.com/andrew_white/2021/07/24/by-2024-60-of-the-data-used-for-the-development-of-ai-and-analytics-projects-will-be-synthetically-generated/



Previous
Previous

Small Models, Big Impact (Part I)

Next
Next

Decoding LLM Implementation: Balancing Accuracy and Complexity