News Score: Score the News, Sort the News, Rewrite the Headlines

Open weight LLMs exhibit inconsistent performance across providers

15th August 2025 Artificial Analysis published a new benchmark the other day, this time focusing on how an individual model—OpenAI’s gpt-oss-120b—performs across different hosted providers. The results showed some surprising differences. Here’s the one with the greatest variance, a run of the 2025 AIME (American Invitational Mathematics Examination) averaging 32 runs against each model, using gpt-oss-120b with a reasoning effort of “high”: These are some varied results! 93.3%: Cerebras, Nebius B...

Read more at simonwillison.net

© News Score  score the news, sort the news, rewrite the headlines