LLMs Double Capabilities Every 7 Months; AI Could Complete Month-Long Human Tasks by 2030, METR Study Finds

Large Language Model Performance Raises Stakes

Benchmarking large language models presents some unusual challenges. For one, the main purpose of many LLMs is to provide compelling text that’s indistinguishable from human writing. And success in that task may not correlate with metrics traditionally used to judge processor performance, such as instruction execution rate.RELATED: LLM Benchmarking Shows Capabilities Doubling Every 7 MonthsBut there are solid reasons to persevere in attempting to gauge the performance of LLMs. Otherwise, it’s im...