Technology
AI models could devour all of the internet’s written knowledge by 2026
Artificial intelligence (AI) systems could devour all of the internet's free knowledge as soon as 2026, a new study has warned.
AI models such as GPT-4, which powers ChatGPT, or Claude 3 Opus rely on the many trillions of words shared online to get smarter, but new projections suggest they will exhaust the supply of publicly-available data sometime between 2026 and 2032.
This means to build better models, tech companies will need to begin looking elsewhere for data. This could include producing synthetic data, turning to lower-quality sources, or more worryingly tapping into private data in servers that store messages and emails. The researchers published their findings June 4 on the preprint server arXiv.
"If chatbots consume all of the available data, and there are no further advances in data efficiency, I would expect to see a relative stagnation in the field," study first author Pablo Villalobos, a researcher at the research institute Epoch AI, told Live Science. "Models [will] only improve slowly over time as new algorithmic insights are discovered and new data is naturally produced."
Training data fuels AI systems' growth — enabling them to fish out ever-more complex patterns to root inside their neural networks. For example, ChatGPT was trained on roughly 570 GB of text data, amounting to roughly 300 billion words, taken from books, online articles, Wikipedia and other online sources.
Algorithms trained on insufficient or low-quality data produce sketchy outputs. Google's Gemini AI, which infamously recommended that people add glue to their pizzas or eat rocks, sourced some of its answers from Reddit posts and articles from the satirical website The Onion.
To estimate how much text is available online, the researchers used Google's web index, calculating that there were currently about 250 billion web pages containing 7,000 bytes of text per page. Then, they used follow-up analyses of internet protocol (IP) traffic — the flow of data across the web — and the activity of users online to project the growth of this available data stock.
-
Technology1h ago
There Is a Solution to AI’s Existential Risk Problem
-
Technology4h ago
US pushes to break up Google, calls for Chrome sell-off in major antitrust move | The Express Tribune
-
Technology7h ago
Public health surveillance, from social media to sewage, spots disease outbreaks early to stop them fast
-
Technology9h ago
TikTok, PTA host youth safety summit in Pakistan | The Express Tribune
-
Technology12h ago
Why a Technocracy Fails Young People
-
Technology1d ago
Transplanting insulin-making cells to treat Type 1 diabetes is challenging − but stem cells offer a potential improvement
-
Technology1d ago
Japan's $26 billion deep sea discovery sparks serious environmental concerns | The Express Tribune
-
Technology1d ago
Should I worry about mold growing in my home?