News Score: Score the News, Sort the News, Rewrite the Headlines

Tensor Product Attention Is All You Need

View PDF HTML (experimental) Abstract:Scaling language models to handle longer input sequences typically necessitates large key-value (KV) caches, resulting in substantial memory overhead during inference. In this paper, we propose Tensor Product Attention (TPA), a novel attention mechanism that uses tensor decompositions to represent queries, keys, and values compactly, significantly shrinking KV cache size at inference time. By factorizing these representations into contextual low-rank compone...

Read more at arxiv.org

© News Score  score the news, sort the news, rewrite the headlines