UC Berkeley's vLLM: New Open-Source Library Boosts LLM Serving Speed 24x with PagedAttention Algorithm

vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention

GitHub | Documentation | Paper LLMs promise to fundamentally change how we use AI across all industries. However, actually serving these models is challenging and can be surprisingly slow even on expensive hardware. Today we are excited to introduce vLLM, an open-source library for fast LLM inference and serving. vLLM utilizes PagedAttention, our new attention algorithm that effectively manages attention keys and values. vLLM equipped with PagedAttention redefines the new state of the art in LLM...