Researchers use synthetic neural cellular automata to pre-train language models; 164M automata tokens outperform 1.6B natural language tokens, cutting perplexity 5% by teaching rule inference over text patterns.

Training Language Models via Neural Cellular Automata

Research Paper Blog What if the path to smarter language models doesn't require more text — but synthetic data from abstract dynamical systems? Paper Code 01 — The Problem We're running out of text Large language models are hungry. They require exponentially more data to keep improving, and high-quality natural language is projected to run out by 2028. Worse, internet text carries human biases and entangles knowledge with reasoning, making it hard to control what models actually learn. This rais...