Study Reveals LLM Performance Degrades with Longer Inputs, Despite High Scores on Simple Retrieval Tasks

Context Rot: How Increasing Input Tokens Impacts LLM Performance

Recent developments in LLMs show a trend toward longer context windows, with the input token count of the latest models reaching the millions. Because these models achieve near-perfect scores on widely adopted benchmarks like Needle in a Haystack (NIAH) [1], it’s often assumed that their performance is uniform across long-context tasks.However, NIAH is fundamentally a simple retrieval task, in which a known sentence (the “needle”) is placed in a long document of unrelated text (the “haystack”), ...