Writing an LLM from scratch, part 13 -- the 'why' of attention, or: attention heads are dumb

Posted on May 11, 25 · Last update 3 days ago

So, that is (right now) my understanding of how scaled dot product attention works. We're just doing simple pattern matching, where each token's input embedding is projected by the query weights into a (learned) embedding space that is able to represent what it is "looking for" in some sense. It's also projected by the key weights into the same space, but this time in a way that makes it point to what it "is" in the same sense. Then the dot product matches those up so that we can associate input embeddings with each other to work out our attention scores.

Writing an LLM from scratch, part 13 -- the 'why' of attention, or: attention heads are dumb

Comments