Understanding Self-Attention: A Hands-On Exploration Link to heading
Quick Take Link to heading
While working through “Build a Large Language Model (From Scratch)” by Sebastian Raschka, I struggled to understand the self-attention mechanism beyond just copying code. By collaborating with Claude Code to explore three key improvements—meaningful values, weight visualizations, and step-by-step computation—I gained a satisfactory understanding in under an hour.
Introduction Link to heading
I’m currently working my way through Build a Large Language Model (From Scratch) by Sebastian Raschka, and yesterday I was focused on understanding and implementing the self-attention mechanism. While the book does a great job of tackling a huge project, it can’t go deep into every single concept. I could have just copied and experimented with the code in the book, but I didn’t feel like I actually understood what the attention mechanism was or how it worked, even at a high level.
I worked with Claude Code, a markdown version of the chapter, and my initial code to explore a couple of interesting things that, while doable myself given enough time, allowed me to get a satisfactory understanding in under an hour.
The Exploration Process Link to heading
First, I asked for ideas: “let’s think about how @book/ch3/self_attention.py could be updated to better demonstrate self-attention.” It came up with ten ideas, three of which I asked it to implement:
- Use more meaningful values
- Add visualizations for the weights
- Step-by-step attention computation
Use More Meaningful Values Link to heading
The original code used random values for word embeddings, which is standard for initializing a model for training but not very helpful for illustrating what is actually happening.
inputs = torch.tensor(
[[0.43, 0.15, 0.89], # Your (x^1)
[0.55, 0.87, 0.66], # journey (x^2)
[0.57, 0.85, 0.64], # starts (x^3)
[0.22, 0.58, 0.33], # with (x^4)
[0.77, 0.25, 0.10], # one (x^5)
[0.05, 0.80, 0.55]] # step (x^6)
)
After implementing the change to use more descriptive embeddings, it became much clearer how to think about what was happening.
# Create more meaningful embeddings where similar concepts have similar vectors
# Dimension 0: entity/concept (high for nouns/pronouns)
# Dimension 1: action/movement (high for verbs/action nouns)
# Dimension 2: grammatical/functional (high for function words)
inputs = torch.tensor(
[
[0.9, 0.1, 0.2], # Your - pronoun (entity)
[0.8, 0.7, 0.1], # journey - noun (entity + action)
[0.2, 0.9, 0.2], # starts - verb (action)
[0.1, 0.1, 0.9], # with - preposition (functional)
[0.7, 0.2, 0.3], # one - number (entity-like)
[0.6, 0.5, 0.2], # step - noun (entity + some action)
]
)
Add Visualizations for the Weights Link to heading
The visualizations were an interesting addition. While not critical to my own understanding in this instance, they provided a helpful visual representation of the attention scores.
Attention Weight Visualization:
(Higher weights shown with more filled blocks)
Your journey starts with one step
----------------------------------------------------------------
Your ▓▓▓ ▓▓▓ ░ ░ ▒▒ ▒▒
journey ▒▒ ▓▓▓ ▒▒ · ▒▒ ▒▒
starts ░ ▓▓▓ ▓▓▓ ░ ░ ▒▒
with ░ ░ ░ ▓▓▓ ▒▒ ░
one ▒▒ ▒▒ ░ ░ ▒▒ ▒▒
step ▒▒ ▓▓▓ ▒▒ ░ ▒▒ ▒▒
Legend: · (0-0.1) ░ (0.1-0.15) ▒▒ (0.15-0.2) ▓▓▓ (0.2-0.3) ████ (>0.3)
Step-by-Step Attention Computation Link to heading
Going over the step-by-step computation, I was starting to get a better feel for everything but couldn’t grasp why most words paid attention to themselves, but “one” and “step” did not—they paid more attention to “journey.”
=== Analyzing Attention Patterns ===
Highest attention weights for each token:
Your pays most attention to itself (weight=0.2108)
journey pays most attention to itself (weight=0.2349)
starts pays most attention to itself (weight=0.2279)
with pays most attention to itself (weight=0.2550)
one pays most attention to 'journey' (weight=0.1948)
(self-attention weight: 0.1746)
step pays most attention to 'journey' (weight=0.2109)
(self-attention weight: 0.1726)
Why do 'one' and 'step' attend more to 'journey' than themselves?
Let's examine the self-similarity scores (diagonal of attention score matrix):
Your self-similarity: 0.8600
journey self-similarity: 1.1400
starts self-similarity: 0.8900
with self-similarity: 0.8300
one self-similarity: 0.6200
step self-similarity: 0.6500
Key insights:
- 'journey' has the highest self-similarity (1.14) due to strong entity+action values
- 'one' (0.62) and 'step' (0.65) have weaker self-similarity
- After softmax normalization, 'journey' becomes an attention magnet
- This demonstrates how attention helps tokens 'borrow' context from semantically rich words
Key Insights Link to heading
Working through these improvements with Claude Code helped me understand several critical aspects of self-attention:
-
Meaningful embeddings matter: Using semantically meaningful values instead of random numbers made the attention patterns much clearer and more intuitive.
-
Attention as context borrowing: The most enlightening moment was discovering why “one” and “step” attended more to “journey” than themselves. This demonstrated how attention allows tokens to “borrow” context from other semantically rich words in the sequence.
-
Self-similarity drives attention: The self-similarity scores (diagonal of the attention matrix) revealed why certain words become “attention magnets” after softmax normalization.
Conclusion Link to heading
This exploration reinforced my belief in the value of hands-on learning with AI assistance. What could have taken hours of solo experimentation was condensed into a productive hour of guided exploration. The combination of having a knowledgeable coding partner and the ability to quickly test ideas allowed me to move past surface-level understanding to genuine comprehension of how self-attention works.
The code from this exploration is available at https://github.com/jjshanks/llm_from_scratch/blob/main/book/ch3/simple_self_attention.py.