KV Cache: Memory-Intensive Technique Slashes Language Model Text Generation Time from O(n³) to O(n)

Multi-Head Latent Attention and Other KV Cache Tricks

January 21, 2025 (1w ago)•Overview: Introduction: We'll explore how Key-Value (KV) caches make language models like ChatGPT faster at generating text, by making a clever trade-off between memory usage and computation time. MLA and other Tricks: We'll then look at 11 recent research papers that build upon this basic idea to make language models even more efficient. Understanding the Problem: Why Text Generation is Slow Let's start with a simple analogy. Imagine you're writing a story, and for eac...