Writing an LLM from scratch, part 13 -- the 'why' of attention, or: attention heads are dumb
Archives
Categories
Blogroll
Now that I've finished chapter 3 of
Sebastian Raschka's book
"Build a Large Language Model (from Scratch)" --
having worked my way through multi-head attention in the last post --
I thought it would be worth pausing to take stock before moving on to Chapter 4.
There are two things I want to cover, the "why" of self-attention, and some thoughts
on context lengths. This post is on the "why" -- that is, why do the particular
set of matrix multiplications described in t...
Read more at gilesthomas.com