News Score: Score the News, Sort the News, Rewrite the Headlines

BASED: Simple linear attention language models balance the recall-throughput tradeoff

Introducing Based, a simple efficient architecture that combines two familiar primitives – sliding window attention and linear attention – to offer high-quality language modeling with strong associative recall capabilities! At inference time, Based decodes without a KV-cache, enabling a 24x throughput improvement over Transformers with Flash-Attention 2!OverviewIn an ICLR paper (and blogpost) we posted towards the end of last year, we share the finding that many efficient architectures (e.g. Mam...

Read more at together.ai

© News Score  score the news, sort the news, rewrite the headlines