Training artificial intelligence (AI) models using data generated by other AI models can lead to a major problem known as "model collapse." This issue arises when AI models, particularly large language models (LLMs) like GPT-3 and GPT-4, are trained on data that has been produced by previous models rather than human-generated content.
Model collapse occurs because, over time, AI models start to lose the ability to understand and generate diverse and accurate information. Instead, they begin to repeat the same errors and misunderstandings present in the data from earlier models. This problem is especially concerning as LLMs contribute more and more text to the internet, creating a feedback loop where future models are trained on flawed data.
Researchers from Oxford, Cambridge, and other institutions demonstrated this by training language models over several generations using AI-generated data. They observed that with each new generation, the models became less accurate and more prone to mistakes. This trend continued even when some human-written data was included in the training set, indicating that the problem is pervasive.
The findings emphasize the importance of using real human-generated data to train AI models. Without this, future models risk becoming less useful and more biased, especially when it comes to rare or unique information. To combat this, the researchers suggest developing better methods to track and verify the sources of data used in AI training, ensuring human-produced content remains a key component.
This discovery is crucial for the future development of AI, helping to ensure that models remain reliable and beneficial as they become more integrated into various aspects of our lives.