"Based Architecture Outperforms Transformers with 24x Throughput Improvement: Merges Sliding Window and Linear Attention for Efficient Language Modeling"

BASED: Simple linear attention language models balance the recall-throughput tradeoff

Introducing Based, a simple efficient architecture that combines two familiar primitives – sliding window attention and linear attention – to offer high-quality language modeling with strong associative recall capabilities! At inference time, Based decodes without a KV-cache, enabling a 24x throughput improvement over Transformers with Flash-Attention 2!OverviewIn an ICLR paper (and blogpost) we posted towards the end of last year, we share the finding that many efficient architectures (e.g. Mam...