LLM coding abilities stagnate for over a year as merge rates show no improvement since early 2025, despite better test-passing performance

Are LLMs not getting better?

I was reading the metr article on how llm code passes test much more often than it is of mergeable quality. They look at the performance of llms doing programming when the success criterion is “passes all tests” and compare it to when the success criterion is “would get approved by the maintainer”. Unsurprisingly, llm performance is much worse under the more stringent success criterion. Their 50 % success horizon moves from 50 minutes down to 8 minutes. As part of this they have included figures...