Technology

AI models trained on AI-generated data could spiral into unintelligible nonsense, scientists warn

Published

3m ago

Aug 10, 2024 / 3110 Views

Evan Walker

Artificial Intelligence (AI) systems could slowly trend toward filling the internet with incomprehensible nonsense, new research has warned.

AI models such as GPT-4, which powers ChatGPT, or Claude 3 Opus rely on the many trillions of words shared online to get smarter, but as they gradually colonize the internet with their own output they may create self-damaging feedback loops.

The end result, called "model collapse" by a team of researchers that investigated the phenomenon, could leave the internet filled with unintelligible gibberish if left unchecked. They published their findings July 24 in the journal Nature.

"Imagine taking a picture, scanning it, then printing it out, and then repeating the process. Through this process the scanner and printer will introduce their errors, over time distorting the image," lead author Ilia Shumailov, a computer scientist at the University of Oxford, told Live Science. "Similar things happen in machine learning — models learning from other models absorb errors, introduce their own, over time breaking model utility."

AI systems grow using training data taken from human input, enabling them to draw probabilistic patterns from their neural networks when given a prompt. GPT-3.5 was trained on roughly 570 gigabytes of text data from the repository Common Crawl, amounting to roughly 300 billion words, taken from books, online articles, Wikipedia and other web pages.

Related: 'Reverse Turing test' asks AI agents to spot a human imposter — you'll never guess how they figure it out

But this human-generated data is finite and will most likely be exhausted by the end of this decade. Once this has happened, the alternatives will be to begin harvesting private data from users or to feed AI-generated "synthetic" data back into models.