"Human-Generated Public Text May Be Fully Utilized by AI Models Between 2026 and 2032, Earlier with Overtraining"

Will We Run Out of Data? Limits of LLM Scaling Based on Human-Generated Data

Introduction Scaling has been a key factor driving progress in AI. Models are growing in parameters and being trained on increasingly enormous datasets, leading to exponential growth in training compute, and dramatic increases in performance. For example, five years and four orders of magnitude of compute separate the barely coherent GPT-2 with the powerful GPT-4. So far, AI developers have not faced major limits to scaling beyond simply procuring AI chips, which are scarce but rapidly growing i...