June 19, 2023 report

AI models feeding on AI data may face death spiral

by Peter Grad , Tech Xplore

Large language models are generating verbal pollution that threatens to undermine the very data such models are trained on.

That's the conclusion reached by a team of British and Canadian researchers exploring the impact of successive generations of ChatGPT generated text that will be culled for future models.

In a paper published on the arXiv preprint server and titled, "The Curse of Recursion: Training on Generated Data Makes Models Forget," the team predicted that the recursive nature of AI training will eventually lead to "model collapse."

"We discover that learning from data produced by other models causes model collapse—a degenerative process whereby, over time, models forget the true underlying data distribution," the team said.

Team member Ross Anderson, of University of Cambridge and University of Edinburgh, likened the effect to the diminishing quality of musical output.

"If you train a music model on Mozart," he said in a personal blog, "you can expect output that's a bit like Mozart but without the sparkle …and if [that version] trains the next generation, and so on, what will the fifth or sixth generation sound like?"

The authors note that model collapse is a threat similar to catastrophic forgetting and data poisoning.

In catastrophic forgetting, a model "forgets" previous data, sometimes abruptly, when learning new information. The impact is compounded over time.

In their new research, the team said, models don't forget previously learned data "but rather start misinterpreting what they believe to be real, by reinforcing their own beliefs."

Data poisoning is the malicious insertion of false information. Of course, this practice predated the use of large language models. But with the use of large-scale web crawls, the insertion of even a small amount of malicious data, the team said, can lead to widespread contamination.

"What is different with the arrival of large language models is the scale at which such poisoning can happen once it is automated," the team said.

Researcher Ilia Shumailov, of the University of Oxford, warned that "major degradation happens within just a few iterations, even when some of the original data is preserved."

"Errors from optimization imperfections, limited models and finite data," he continued, "ultimately cause synthetic data to be of low[er] quality. Over time mistakes compound and ultimately force models that learn from generated data to misperceive reality even further."

The researchers said that the nature of recursive learning is to dispense with low-probability events, referred to by statisticians as "tails of the distribution"

In his blog, Anderson warned, "using model-generated content in training causes irreversible defects. The tails of the original content distribution disappear. Within a few generations, text becomes garbage."

"Low-probability events are … vital to understand complex systems," the report noted.

The first large language models were trained on human-generated text. But with the rapid adoption of ChatGPT by industry and general users, enormous amounts of data are populating online sites.

The researchers urged that steps be taken to distinguish AI content from human-generated content and that efforts be made to preserve original content for future training purposes.

"Large language models are like fire," team member Anderson said, "a useful tool, but one that pollutes the environment. How will we cope with it?"

More information: Ilia Shumailov et al, The Curse of Recursion: Training on Generated Data Makes Models Forget, arXiv (2023). DOI: 10.48550/arxiv.2305.17493

Journal information: arXiv

Citation: AI models feeding on AI data may face death spiral (2023, June 19) retrieved 17 July 2024 from https://techxplore.com/news/2023-06-ai-death-spiral.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.

Explore further

Did you hear the one about ChatGPT telling jokes? Study highlights challenges of humor for large language models

555 shares

Feedback to editors

Engineers develop technique to pinpoint nanoscale 'hot spots' in electronics to improve their longevity

18 minutes ago

Researchers create insect-inspired autonomous navigation strategy for tiny, lightweight robots

18 minutes ago

Soft, stretchy 'jelly batteries' inspired by electric eels

18 minutes ago

Astronomy methods applied to reflections in eyes could help with spotting deepfakes

19 minutes ago

The magnet trick: New invention makes vibrations disappear

2 hours ago

Creating and verifying stable AI-controlled robotic systems in a rigorous and flexible way

2 hours ago

Unlocking the potential of rust: High-efficiency green hydrogen production from hematite

3 hours ago

Scientists bridge the 'valley of death' in carbon capture technologies

3 hours ago

Flexible electronics researchers develop a completely stretchy lithium-ion battery

6 hours ago

A strategy to enhance the stability of perovskite solar cells under reverse bias conditions

7 hours ago

Load comments (1)

AI models feeding on AI data may face death spiral

Engineers develop technique to pinpoint nanoscale 'hot spots' in electronics to improve their longevity

Researchers create insect-inspired autonomous navigation strategy for tiny, lightweight robots

Soft, stretchy 'jelly batteries' inspired by electric eels

Astronomy methods applied to reflections in eyes could help with spotting deepfakes

The magnet trick: New invention makes vibrations disappear

Creating and verifying stable AI-controlled robotic systems in a rigorous and flexible way

Unlocking the potential of rust: High-efficiency green hydrogen production from hematite

Scientists bridge the 'valley of death' in carbon capture technologies

Flexible electronics researchers develop a completely stretchy lithium-ion battery

A strategy to enhance the stability of perovskite solar cells under reverse bias conditions

Did you hear the one about ChatGPT telling jokes? Study highlights challenges of humor for large language models

Text generators may plagiarize beyond 'copy and paste'

Researchers make language models scalable self-learners

Exploring text-to-audio models to make music from scratch

Q&A: Professor discusses ChatGPT-inspired large language model built for the finance industry

Experts encourage proactive use of ChatGPT with new ethical standards

Creating and verifying stable AI-controlled robotic systems in a rigorous and flexible way

New system enables intuitive teleoperation of a robotic manipulator in real-time

Machine learning framework maps global rooftop growth for sustainable energy and urban planning

Microsoft unveils software that allows LLMs to work with spreadsheets

New technique to assess a general-purpose AI model's reliability before it's deployed

Large language models make human-like reasoning mistakes, researchers find

Phys.org

Medical Xpress

Science X

AI models feeding on AI data may face death spiral

Engineers develop technique to pinpoint nanoscale 'hot spots' in electronics to improve their longevity

Researchers create insect-inspired autonomous navigation strategy for tiny, lightweight robots

Soft, stretchy 'jelly batteries' inspired by electric eels

Astronomy methods applied to reflections in eyes could help with spotting deepfakes

The magnet trick: New invention makes vibrations disappear

Creating and verifying stable AI-controlled robotic systems in a rigorous and flexible way

Unlocking the potential of rust: High-efficiency green hydrogen production from hematite

Scientists bridge the 'valley of death' in carbon capture technologies

Flexible electronics researchers develop a completely stretchy lithium-ion battery

A strategy to enhance the stability of perovskite solar cells under reverse bias conditions

Related Stories

Did you hear the one about ChatGPT telling jokes? Study highlights challenges of humor for large language models

Text generators may plagiarize beyond 'copy and paste'

Researchers make language models scalable self-learners

Exploring text-to-audio models to make music from scratch

Q&A: Professor discusses ChatGPT-inspired large language model built for the finance industry

Experts encourage proactive use of ChatGPT with new ethical standards

Recommended for you

Creating and verifying stable AI-controlled robotic systems in a rigorous and flexible way

New system enables intuitive teleoperation of a robotic manipulator in real-time

Machine learning framework maps global rooftop growth for sustainable energy and urban planning

Microsoft unveils software that allows LLMs to work with spreadsheets

New technique to assess a general-purpose AI model's reliability before it's deployed

Large language models make human-like reasoning mistakes, researchers find

Your Privacy