News Score: Score the News, Sort the News, Rewrite the Headlines

How has DeepSeek improved the Transformer architecture?

DeepSeek has recently released DeepSeek v3, which is currently state-of-the-art in benchmark performance among open-weight models, alongside a technical report describing in some detail the training of the model. Impressively, they’ve achieved this SOTA performance by only using 2.8 million H800 hours of training hardware time—equivalent to about 4e24 FLOP if we assume 40% MFU. This is about ten times less training compute than the similarly performing Llama 3.1 405B. In this issue, I’ll cover s...

Read more at epoch.ai

© News Score  score the news, sort the news, rewrite the headlines