ZeroMerge: Parameter-Free KV Cache Compression for Memory-Efficient Long-Context LLMs
View PDF
HTML (experimental)
Abstract:The linear growth of key-value (KV) cache memory and quadratic computational complexity pose significant bottlenecks for large language models (LLMs) in long-context processing. While existing KV cache optimization methods address these challenges through token pruning or feature merging, they often suffer from irreversible information loss or require costly parameter retraining. We propose ZeroMerge, a dynamic zero-shot compression framework that achieves e...
Read more at arxiv.org