Tech Blogger Explains Why Attention Heads in LLMs Are Simpler Than Expected; Layering Key to Functionality

Writing an LLM from scratch, part 13 -- the 'why' of attention, or: attention heads are dumb

Archives Categories Blogroll Now that I've finished chapter 3 of Sebastian Raschka's book "Build a Large Language Model (from Scratch)" -- having worked my way through multi-head attention in the last post -- I thought it would be worth pausing to take stock before moving on to Chapter 4. There are two things I want to cover, the "why" of self-attention, and some thoughts on context lengths. This post is on the "why" -- that is, why do the particular set of matrix multiplications described in t...