DeepSeek v3 Achieves SOTA Performance with 10x Less Training Compute: MLA and DeepSeekMoE Innovations Reduce KV Cache Size for Long-Context Inference

How has DeepSeek improved the Transformer architecture?

DeepSeek has recently released DeepSeek v3, which is currently state-of-the-art in benchmark performance among open-weight models, alongside a technical report describing in some detail the training of the model. Impressively, they’ve achieved this SOTA performance by only using 2.8 million H800 hours of training hardware time—equivalent to about 4e24 FLOP if we assume 40% MFU. This is about ten times less training compute than the similarly performing Llama 3.1 405B. In this issue, I’ll cover s...