Baseten Achieves 500+ Tokens/Second for GPT OSS 120B on NVIDIA GPUs; Optimizes Performance Through Multi-Framework Testing, Architecture Compatibility, and Advanced Parallelism Techniques

How we run GPT OSS 120B at 500+ tokens per second on NVIDIA GPUs

The day an open source model like OpenAI’s new gpt-oss-120b is released, we race to make the model as performant as possible for our customers. As a launch partner for OpenAI’s first open-source LLM since 2019, we wanted to give developers a great experience with the new LLMs.By the end of launch day, we were the clear leader running on NVIDIA GPUs for both latency and throughput per public data from real-world use on OpenRouter.✕What matters is having the inference optimization muscle to immedi...