Tokasaurus: An LLM Inference Engine for High-Throughput Workloads
TL;DR
We’re releasing Tokasaurus, a new LLM inference engine optimized for throughput-intensive workloads. With small models, Tokasaurus benefits from very low CPU overhead and dynamic Hydragen grouping to exploit shared prefixes. For larger models, Tokasaurus supports async tensor parallelism for GPUs with NVLink and a fast implementation of pipeline parallelism for GPUs without. On throughput-focused benchmarks, Tokasaurus can outperform vLLM and SGLang by up to 3x+.
Table of Contents
Intro
Op...
Read more at scalingintelligence.stanford.edu