New benchmark reveals inconsistent performance of OpenAI's gpt-oss-120b model across providers; CompactifAI scores lowest at 36.7%, while others reach 93.3%

Open weight LLMs exhibit inconsistent performance across providers

15th August 2025 Artificial Analysis published a new benchmark the other day, this time focusing on how an individual model—OpenAI’s gpt-oss-120b—performs across different hosted providers. The results showed some surprising differences. Here’s the one with the greatest variance, a run of the 2025 AIME (American Invitational Mathematics Examination) averaging 32 runs against each model, using gpt-oss-120b with a reasoning effort of “high”: These are some varied results! 93.3%: Cerebras, Nebius B...