BASED: Simple linear attention language models balance the recall-throughput tradeoff
Introducing Based, a simple efficient architecture that combines two familiar primitives – sliding window attention and linear attention – to offer high-quality language modeling with strong associative recall capabilities! At inference time, Based decodes without a KV-cache, enabling a 24x throughput improvement over Transformers with Flash-Attention 2!OverviewIn an ICLR paper (and blogpost) we posted towards the end of last year, we share the finding that many efficient architectures (e.g. Mam...
Read more at together.ai