Attention Wasn't All We Needed
There's a lot of modern transformer techniques that have been developed since the original Attention Is All You Need paper. Let's look at some of the most important ones that have been developed over the years and try to implement the basic ideas as succintly as possible. We'll use the Pytorch framework for most of the examples.
Group Query Attention
Multi-head Latent Attention
Flash Attention
Ring Attention
Pre-normalization
RMSNorm
SwiGLU
Rotary Positional Embedding
Mixture of Experts
Learning...
Read more at stephendiehl.com