Researchers Unveil ZeroMerge: Parameter-Free KV Cache Compression Doubles LLM Inference Speed at 40K Tokens, Maintains Quality at 5% Compression

ZeroMerge: Parameter-Free KV Cache Compression for Memory-Efficient Long-Context LLMs

View PDF HTML (experimental) Abstract:The linear growth of key-value (KV) cache memory and quadratic computational complexity pose significant bottlenecks for large language models (LLMs) in long-context processing. While existing KV cache optimization methods address these challenges through token pruning or feature merging, they often suffer from irreversible information loss or require costly parameter retraining. We propose ZeroMerge, a dynamic zero-shot compression framework that achieves e...