Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length
View PDF
HTML (experimental)
Abstract:The quadratic complexity and weak length extrapolation of Transformers limits their ability to scale to long sequences, and while sub-quadratic solutions like linear attention and state space models exist, they empirically underperform Transformers in pretraining efficiency and downstream task accuracy. We introduce Megalodon, a neural architecture for efficient sequence modeling with unlimited context length. Megalodon inherits the architecture of Mega (exp...
Read more at arxiv.org