New Tensor Product Attention Model Outperforms Transformers, Enabling Longer Sequences with Less Memory

Tensor Product Attention Is All You Need

View PDF HTML (experimental) Abstract:Scaling language models to handle longer input sequences typically necessitates large key-value (KV) caches, resulting in substantial memory overhead during inference. In this paper, we propose Tensor Product Attention (TPA), a novel attention mechanism that uses tensor decompositions to represent queries, keys, and values compactly, significantly shrinking KV cache size at inference time. By factorizing these representations into contextual low-rank compone...