TransMLA: New Method Converts GQA-Based LLMs to MLA, Boosting Efficiency and Expressiveness Without Increasing Cache Size

TransMLA: Multi-Head Latent Attention Is All You Need

View PDF HTML (experimental) Abstract:Modern large language models (LLMs) often encounter communication bottlenecks on current hardware, rather than purely computational constraints. Multi-head Latent Attention (MLA) tackles this challenge by using low-rank matrices in the key-value (KV) layers, thereby allowing compressed latent KV states to be cached. This approach significantly reduces the KV cache size relative to traditional multi-head attention, leading to faster inference. Moreover, MLA e...