Synthetic Data:
How AI Is Learning from Imaginary Worlds
The internet has a finite amount of high-quality text. We may have already used most of it. The response from AI labs is to generate training data from models themselves. This is either a breakthrough or a dangerous feedback loop — possibly both.
The scaling hypothesis that drove AI progress from 2018 to 2024 rested on a simple premise: more data, more compute, better models. The compute side scaled. The data side has a ceiling. There is only so much high-quality text on the internet, and the estimates of when frontier models will exhaust that supply range from "already" to "within a few years."[1]
The industry's response to this constraint has been synthetic data: generating training examples using AI systems themselves. The approach ranges from simple data augmentation — paraphrasing existing examples — to fully synthetic dataset generation, where a capable model produces thousands of reasoning examples, code solutions, or dialogue transcripts that are then used to train the next model.
This is simultaneously one of the most promising and most theoretically fraught developments in current AI research.
What synthetic data is and how it is generated
Synthetic data for AI training encompasses several distinct techniques:
- Augmentation — modifying existing real examples: paraphrasing, back-translation, noise injection, synonym replacement. Long-established in NLP and computer vision.
- Model-generated completions — prompting a capable model to generate answers, solutions, or continuations that can serve as training labels for a smaller model.
- Self-play and self-improvement — a model generating its own training signal by solving problems, checking its answers, and learning from its corrections. AlphaGo/AlphaZero's self-play approach is the archetypal example; its extension to language reasoning is more recent.
- Fully synthetic datasets — generating entire datasets from scratch using templates, samplers, or generative models, with no real-world source data.
Microsoft's Phi series of small language models demonstrated that synthetic data could be transformative.[2] Phi-1, a 1.3 billion parameter model, was trained on a carefully curated mixture of real code from GitHub and "textbook-quality" synthetic Python explanations generated by GPT-4. Despite its small size, it outperformed much larger models on code benchmarks. The key variable was not parameter count but data quality — and synthetic generation had produced data of sufficient quality to teach programming at a level exceeding what raw internet code could provide.
Self-improvement: the AlphaCode moment
DeepMind's AlphaCode 2 (2023) achieved competitive programming performance at the level of the top 15% of human competitors on Codeforces — a platform used for competitive programming.[3] A significant component of its training involved generating candidate solutions, testing them against problem constraints, and using successful solutions to improve the model's sampling distribution — a form of synthetic self-improvement grounded in verifiable external feedback.
The key property that makes code and mathematics particularly amenable to synthetic self-improvement is verifiability: you can check whether a program produces the correct output or whether a mathematical proof is valid, without human judgment. This external verification signal prevents the most dangerous failure mode of pure synthetic generation — the model learning from its own errors.
Google DeepMind's Gemini 1.5 and subsequent models, along with Anthropic's Claude 3 family, use variants of this approach for their mathematical and coding capabilities, though the precise training details are not fully disclosed.[4]
The model collapse problem
The most significant theoretical concern with synthetic data is model collapse — the progressive degradation of model capabilities when trained on AI-generated data without sufficient grounding in real human-produced content.
A 2023 paper from researchers at Oxford, Cambridge, and Toronto formalised this concern.[5] Their analysis showed that models trained on data generated by previous model generations exhibit two failure modes: first, tails of the distribution are progressively lost as each generation of model underrepresents low-frequency but important phenomena; second, errors and biases in generated data are amplified across generations. The result, over multiple synthetic training cycles, is a model that has "forgotten" the diversity of the original training distribution.
"Learning from data produced by other neural networks causes a model to misperceive the world, in a way that compounds over generations of training."
— Shumailov et al., The Curse of Recursion, Nature 2024[5]
The implications are significant. If the internet is increasingly populated by AI-generated text — a trend that is already observable in certain domains — then future models trained on web-scraped data may be training on polluted synthetic data without knowing it. The model collapse paper found that even a small percentage of synthetic data mixed into real training data could accelerate the degradation.
Mitigations and research directions
The research response to model collapse has been active. Key mitigations identified in the literature include:
- Maintaining access to original data — retaining a substantial fraction of real human-generated data in training mixtures across all generations. The decay is significantly reduced when the original distribution is not fully replaced.
- Verifiable synthetic data — restricting synthetic self-improvement to domains with external ground truth (code, mathematics, formal logic) where errors can be caught before they enter the training distribution.
- Diversity-preserving generation — explicitly sampling from the tails of the distribution during synthetic generation, using techniques like temperature scaling, diverse beam search, or rejection sampling to maintain distributional coverage.
- Provenance tracking — marking AI-generated content at the data level, so training pipelines can control the proportion and source of synthetic content. The C2PA standard for content provenance, supported by Adobe, Microsoft, and others, is one framework being developed for this purpose.[6]
The data provenance crisis
Synthetic data does not exist in isolation from the broader crisis in AI training data provenance. Multiple ongoing legal cases — including cases brought by The New York Times, Getty Images, and class-action suits by authors — contest whether AI training on copyrighted material constitutes infringement.[7] The legal landscape is unsettled and jurisdiction-dependent.
Synthetic data potentially sidesteps some of these issues — content generated by a model does not carry the same copyright status as content authored by a human — but creates new ones. If a model generates synthetic training data that reflects patterns learned from copyrighted material, does that constitute derivative use? The courts have not answered this question. The industry has not waited for them to do so.
The trajectory
Synthetic data is not a stopgap. For specific domains — mathematics, code, formal reasoning, instruction following — it is already producing results that exceed what real data alone can provide. The challenge is preventing the feedback loops that cause collapse, maintaining contact with the distributional richness of real human-generated content, and building the provenance infrastructure that lets researchers understand what their models were actually trained on.
The broader implication is that the simple "more data" paradigm of the early scaling era is over. Data quality, data curation, and data generation have become the central engineering challenges of AI training. The models that will define the next capability threshold will be trained not on a bigger crawl of the internet but on carefully constructed synthetic datasets designed to teach specific capabilities at high efficiency.
References
- Villalobos, P. et al. (2022). Will we run out of data? An analysis of the limits of scaling datasets in Machine Learning. arXiv:2211.04325. arxiv.org/abs/2211.04325 ↩
- Gunasekar, S. et al. (2023). Textbooks Are All You Need. arXiv:2306.11644. arxiv.org/abs/2306.11644 ↩
- AlphaCode Team, Google DeepMind. (2023). AlphaCode 2 Technical Report. deepmind.com ↩
- Anthropic. (2024). The Claude 3 Model Family: Opus, Sonnet, Haiku. Anthropic Technical Report. anthropic.com ↩
- Shumailov, I. et al. (2024). AI models collapse when trained on recursively generated data. Nature, 631, 755–759. nature.com/articles/s41586-024-07566-y ↩
- Coalition for Content Provenance and Authenticity (C2PA). (2023). C2PA Specification v1.3. c2pa.org ↩
- Henderson, P. et al. (2023). Foundation Models and Fair Use. arXiv:2303.15715. arxiv.org/abs/2303.15715 ↩