Stanford Researchers Launch Tokasaurus: New LLM Engine Boosts Throughput 3x for High-Volume AI Tasks

Tokasaurus: An LLM Inference Engine for High-Throughput Workloads

TL;DR We’re releasing Tokasaurus, a new LLM inference engine optimized for throughput-intensive workloads. With small models, Tokasaurus benefits from very low CPU overhead and dynamic Hydragen grouping to exploit shared prefixes. For larger models, Tokasaurus supports async tensor parallelism for GPUs with NVLink and a fast implementation of pipeline parallelism for GPUs without. On throughput-focused benchmarks, Tokasaurus can outperform vLLM and SGLang by up to 3x+. Table of Contents Intro Op...

Leaderboard Submit About