News Score: Score the News, Sort the News, Rewrite the Headlines

Tokasaurus: An LLM Inference Engine for High-Throughput Workloads

TL;DR We’re releasing Tokasaurus, a new LLM inference engine optimized for throughput-intensive workloads. With small models, Tokasaurus benefits from very low CPU overhead and dynamic Hydragen grouping to exploit shared prefixes. For larger models, Tokasaurus supports async tensor parallelism for GPUs with NVLink and a fast implementation of pipeline parallelism for GPUs without. On throughput-focused benchmarks, Tokasaurus can outperform vLLM and SGLang by up to 3x+. Table of Contents Intro Op...

Read more at scalingintelligence.stanford.edu

© News Score  score the news, sort the news, rewrite the headlines