Steps must be taken to label AI-generated content from human-generated ones, scientists say
Thank you for reading this post, don’t forget to subscribe!
But using the output data from such AI systems to further train subsequent generations of AI models could result in “irreversible defects” and junk content, according to a new, yet-to-be peer-reviewed study.
AI models like ChatGPT are trained using vast amounts of data pulled across internet platforms that have mostly remained human generated until now.
But AI-generated data using such models have a growing presence on the internet.
New generations of AIs
Researchers, including those from the University of Oxford in the UK, attempted to understand what happened when several subsequent generations of AIs are trained off each other.
They found the widespread use of LLMs to publish content on the internet on a large scale “will pollute the collection of data to train them” and lead to “model collapse”.
“We discover that learning from data produced by other models causes model collapse – a degenerative process whereby, over time, models forget the true underlying data distribution,” scientists wrote in the study, posted as a preprint in arXiv.
The new findings suggested there to be a “first mover advantage” when it comes to training LLMs.
Scientists liken this change to what happens when AI models are trained on music created by human composers and played by human musicians. The subsequent AI output then trains other models, leading to a diminishing quality of music.
With subsequent generations of AI models likely to encounter poorer quality data at their source, they may start misinterpreting information by inserting false information in a process scientists call “data poisoning”.
They warned that the scale at which data poisoning can happen drastically changes after the advent of LLMs.
Just a few iterations of data can lead to major degradation, even when the original data is preserved, scientists said.
And over time, this could lead to mistakes compounding and forcing models that learn from generated data to misunderstand reality.
“This in turn causes the model to misperceive the underlying learning task,” researchers said.
Scientists cautioned that steps must be taken to label AI-generated content from human-generated ones, along with efforts to preserve original human-made data for future AI training.
“To make sure that learning is sustained over a long time period, one needs to make sure that access to the original data source is preserved and that additional data not generated by LLMs remain available over time,” they wrote in the study.
“Otherwise, it may become increasingly difficult to train newer versions of LLMs without access to data that was crawled from the Internet prior to the mass adoption of the technology, or direct access to data generated by humans at scale.”