From 16-Bit to 1-Bit: Visual KV Cache Quantization for Memory-Efficient Multimodal Large Language Models
View PDF
HTML (experimental)
Abstract:Multimodal Large Language Models (MLLMs) have achieved remarkable success across various applications, yet their computational overhead during deployment remains a critical challenge. While Key-Value (KV) caching improves inference efficiency by trading memory for computation, the growing memory footprint from storing extensive KV caches reduces throughput and limits long-term execution on devices with constrained GPU memory. Existing approaches primarily fo...
Read more at arxiv.org