How we run GPT OSS 120B at 500+ tokens per second on NVIDIA GPUs
The day an open source model like OpenAI’s new gpt-oss-120b is released, we race to make the model as performant as possible for our customers. As a launch partner for OpenAI’s first open-source LLM since 2019, we wanted to give developers a great experience with the new LLMs.By the end of launch day, we were the clear leader running on NVIDIA GPUs for both latency and throughput per public data from real-world use on OpenRouter.✕What matters is having the inference optimization muscle to immedi...
Read more at baseten.co