"Megalodon: New Neural Architecture Outperforms Transformers with Unlimited Context Length, Achieving Greater Efficiency in Sequence Modeling"

Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length

View PDF HTML (experimental) Abstract:The quadratic complexity and weak length extrapolation of Transformers limits their ability to scale to long sequences, and while sub-quadratic solutions like linear attention and state space models exist, they empirically underperform Transformers in pretraining efficiency and downstream task accuracy. We introduce Megalodon, a neural architecture for efficient sequence modeling with unlimited context length. Megalodon inherits the architecture of Mega (exp...