Open weight LLMs exhibit inconsistent performance across providers
15th August 2025
Artificial Analysis published a new benchmark the other day, this time focusing on how an individual model—OpenAI’s gpt-oss-120b—performs across different hosted providers.
The results showed some surprising differences. Here’s the one with the greatest variance, a run of the 2025 AIME (American Invitational Mathematics Examination) averaging 32 runs against each model, using gpt-oss-120b with a reasoning effort of “high”:
These are some varied results!
93.3%: Cerebras, Nebius B...
Read more at simonwillison.net