"Novel Method Increases Inference Efficiency of Large Language Models by 26x, Reduces Memory Consumption Through Layer-Condensed KV Cache, Code Available for Integration"

Layer-Condensed KV Cache for Efficient Inference of Large Language Models

View PDF HTML (experimental) Abstract:Huge memory consumption has been a major bottleneck for deploying high-throughput large language models in real-world applications. In addition to the large number of parameters, the key-value (KV) cache for the attention mechanism in the transformer architecture consumes a significant amount of memory, especially when the number of layers is large for deep language models. In this paper, we propose a novel method that only computes and caches the KVs of a s...