July 30, 2024

Breaking MAD: Generative AI could break the internet

by Silvia Cernea Clark, Rice University

Generative artificial intelligence (AI) models like OpenAI's GPT-4o or Stability AI's Stable Diffusion are surprisingly capable at creating new text, code, images and videos. Training them, however, requires such vast amounts of data that developers are already running up against supply limitations and may soon exhaust training resources altogether.

Against this backdrop of data scarcity, using synthetic data to train future generations of the AI models may seem like an alluring option to big tech for a number of reasons, including: AI-synthesized data is cheaper than real-world data and virtually limitless in terms of supply; it poses fewer privacy risks (as in the case of medical data); and in some cases, synthetic data may even improve AI performance.

However, recent work by the Digital Signal Processing group at Rice University has found that a diet of synthetic data can have significant negative impacts on generative AI models' future iterations.

"The problems arise when this synthetic data training is, inevitably, repeated, forming a kind of a feedback loop--what we call an autophagous or 'self-consuming' loop," said Richard Baraniuk, Rice's C. Sidney Burrus Professor of Electrical and Computer Engineering. "Our group has worked extensively on such feedback loops, and the bad news is that even after a few generations of such training, the new models can become irreparably corrupted. This has been termed 'model collapse' by some--most recently by colleagues in the field in the context of large language models (LLMs). We, however, find the term 'Model Autophagy Disorder' (MAD) more apt, by analogy to mad cow disease."

Mad cow disease is a fatal neurodegenerative illness that affects cows and has a human equivalent caused by consuming infected meat. A major outbreak in the 1980-90s brought attention to the fact that mad cow disease proliferated as a result of the practice of feeding cows the processed leftovers of their slaughtered peers--hence the term "autophagy," from the Greek auto-, which means "self,"' and phagy--"to eat."

"We captured our findings on MADness in a paper presented in May at the International Conference on Learning Representations (ICLR)," Baraniuk said.

The study, titled "Self-Consuming Generative Models Go MAD," is the first peer-reviewed work on AI autophagy and focuses on generative image models like the popular DALL·E 3, Midjourney and Stable Diffusion.

"We chose to work on visual AI models to better highlight the drawbacks of autophagous training, but the same mad cow corruption issues occur with LLMs, as other groups have pointed out," Baraniuk said.

The internet is usually the source of generative AI models' training datasets, so as synthetic data proliferates online, self-consuming loops are likely to emerge with each new generation of a model. To get insight into different scenarios of how this might play out, Baraniuk and his team studied three variations of self-consuming training loops designed to provide a realistic representation of how real and synthetic data are combined into training datasets for generative models:

Fully synthetic loop--Successive generations of a generative model were fed a fully synthetic data diet sampled from prior generations' output.
Synthetic augmentation loop--The training dataset for each generation of the model included a combination of synthetic data sampled from prior generations and a fixed set of real training data.
Fresh data loop--Each generation of the model is trained on a mix of synthetic data from prior generations and a fresh set of real training data.

Progressive iterations of the loops revealed that over time and in the absence of sufficient fresh real data, the models would generate increasingly warped outputs lacking either quality, diversity or both. In other words, the more fresh data, the healthier the AI.

Side-by-side comparisons of image datasets resulting from successive generations of a model paint an eerie picture of potential AI futures. Datasets consisting of human faces become increasingly streaked with gridlike scars--what the authors call "generative artifacts"--or look more and more like the same person. Datasets consisting of numbers morph into indecipherable scribbles.

"Our theoretical and empirical analyses have enabled us to extrapolate what might happen as generative models become ubiquitous and train future models in self-consuming loops," Baraniuk said. "Some ramifications are clear: Without enough fresh real data, future generative models are doomed to MADness."

To make these simulations even more realistic, the researchers introduced a sampling bias parameter to account for "cherry picking"--the tendency of users to favor data quality over diversity, i.e., to trade off variety in the types of images and texts in a dataset for images or texts that look or sound good.

The incentive for cherry picking is that data quality is preserved over a greater number of model iterations, but this comes at the expense of an even steeper decline in diversity.

"One doomsday scenario is that if left uncontrolled for many generations, MAD could poison the data quality and diversity of the entire internet," Baraniuk said. "Short of this, it seems inevitable that as-to-now-unseen unintended consequences will arise from AI autophagy even in the near term."

In addition to Baraniuk, study authors include Rice Ph.D. students Sina Alemohammad; Josue Casco-Rodriguez; Ahmed Imtiaz Humayun; Hossein Babaei; Rice Ph.D. alumnus Lorenzo Luzi; Rice Ph.D. alumnus and current Stanford postdoctoral student Daniel LeJeune; and Simons Postdoctoral Fellow Ali Siahkoohi.

More information: Self-Consuming Generative Models Go MAD (2024)

Provided by Rice University

Citation: Breaking MAD: Generative AI could break the internet (2024, July 30) retrieved 30 July 2024 from https://techxplore.com/news/2024-07-mad-generative-ai-internet.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.

Explore further

Using AI to train AI: Model collapse could be coming for LLMs, say researchers

0 shares

Feedback to editors

Researchers introduce knitted furniture

2 hours ago

Researchers to present new tool for enhancing AI transparency and accuracy at conference

2 hours ago

AI chatbots exhibit unique decision-making biases, study finds

5 hours ago

Who is more polarized about AI—the tech community or the general public?

5 hours ago

Security researchers reveal it is possible to eavesdrop on HDMI cables to capture computer screen data

9 hours ago

New 3D integrated metal-oxide transistors to fabricate compact and high density electronics

10 hours ago

Tesla recalling more than 1.8M vehicles due to hood issue

11 hours ago

Self-powered 'bugs' can skim across water to detect environmental data

Jul 29, 2024

Shape-shifting 'transformer bots' inspired by origami

Jul 29, 2024

Phantom data could show copyright holders if their work is in AI training data

Jul 29, 2024

Load comments (0)

Breaking MAD: Generative AI could break the internet

Researchers introduce knitted furniture

Researchers to present new tool for enhancing AI transparency and accuracy at conference

AI chatbots exhibit unique decision-making biases, study finds

Who is more polarized about AI—the tech community or the general public?

Security researchers reveal it is possible to eavesdrop on HDMI cables to capture computer screen data

New 3D integrated metal-oxide transistors to fabricate compact and high density electronics

Tesla recalling more than 1.8M vehicles due to hood issue

Self-powered 'bugs' can skim across water to detect environmental data

Shape-shifting 'transformer bots' inspired by origami

Phantom data could show copyright holders if their work is in AI training data

Using AI to train AI: Model collapse could be coming for LLMs, say researchers

AI tool creates 'synthetic' images of cells for enhanced microscopy analysis

Diversifying data to beat bias

Clear guidelines needed for synthetic data to ensure transparency, accountability and fairness, study says

When it comes to AI, can we ditch the datasets?

Training AI requires more data than we have—generating synthetic data could help solve this challenge

Researchers introduce knitted furniture

Researchers to present new tool for enhancing AI transparency and accuracy at conference

AI chatbots exhibit unique decision-making biases, study finds

Who is more polarized about AI—the tech community or the general public?

Phantom data could show copyright holders if their work is in AI training data

Robot Spot configured to find and stun weeds using a blowtorch

Phys.org

Medical Xpress

Science X

Breaking MAD: Generative AI could break the internet

Researchers introduce knitted furniture

Researchers to present new tool for enhancing AI transparency and accuracy at conference

AI chatbots exhibit unique decision-making biases, study finds

Who is more polarized about AI—the tech community or the general public?

Security researchers reveal it is possible to eavesdrop on HDMI cables to capture computer screen data

New 3D integrated metal-oxide transistors to fabricate compact and high density electronics

Tesla recalling more than 1.8M vehicles due to hood issue

Self-powered 'bugs' can skim across water to detect environmental data

Shape-shifting 'transformer bots' inspired by origami

Phantom data could show copyright holders if their work is in AI training data

Related Stories

Using AI to train AI: Model collapse could be coming for LLMs, say researchers

AI tool creates 'synthetic' images of cells for enhanced microscopy analysis

Diversifying data to beat bias

Clear guidelines needed for synthetic data to ensure transparency, accountability and fairness, study says

When it comes to AI, can we ditch the datasets?

Training AI requires more data than we have—generating synthetic data could help solve this challenge

Recommended for you

Researchers introduce knitted furniture

Researchers to present new tool for enhancing AI transparency and accuracy at conference

AI chatbots exhibit unique decision-making biases, study finds

Who is more polarized about AI—the tech community or the general public?

Phantom data could show copyright holders if their work is in AI training data

Robot Spot configured to find and stun weeds using a blowtorch

Your Privacy