Are LLMs not getting better?
I was reading the metr article on how llm code passes test much more often
than it is of mergeable quality. They look at the performance of llms doing
programming when the success criterion is “passes all tests” and compare it to
when the success criterion is “would get approved by the maintainer”.
Unsurprisingly, llm performance is much worse under the more stringent success
criterion. Their 50 % success horizon moves from 50 minutes down to 8 minutes.
As part of this they have included figures...
Read more at entropicthoughts.com