LLM Architectures Evolve: DeepSeek V3 Introduces Multi-Head Latent Attention and Mixture-of-Experts in 2024-2025

The Big LLM Architecture Comparison

It has been seven years since the original GPT architecture was developed. At first glance, looking back at GPT-2 (2019) and forward to DeepSeek-V3 and Llama 4 (2024-2025), one might be surprised at how structurally similar these models still are.Sure, positional embeddings have evolved from absolute to rotational (RoPE), Multi-Head Attention has largely given way to Grouped-Query Attention, and the more efficient SwiGLU has replaced activation functions like GELU. But beneath these minor refine...

Read more at magazine.sebastianraschka.com

Leaderboard Submit About