Understanding Self-Attention: A Hands-On Exploration Link to heading

Quick Take Link to heading

While working through “Build a Large Language Model (From Scratch)” by Sebastian Raschka, I struggled to understand the self-attention mechanism beyond just copying code. By collaborating with Claude Code to explore three key improvements—meaningful values, weight visualizations, and step-by-step computation—I gained a satisfactory understanding in under an hour.

Introduction Link to heading

I’m currently working my way through Build a Large Language Model (From Scratch) by Sebastian Raschka, and yesterday I was focused on understanding and implementing the self-attention mechanism. While the book does a great job of tackling a huge project, it can’t go deep into every single concept. I could have just copied and experimented with the code in the book, but I didn’t feel like I actually understood what the attention mechanism was or how it worked, even at a high level.

I worked with Claude Code, a markdown version of the chapter, and my initial code to explore a couple of interesting things that, while doable myself given enough time, allowed me to get a satisfactory understanding in under an hour.

The Exploration Process Link to heading

First, I asked for ideas: “let’s think about how @book/ch3/self_attention.py could be updated to better demonstrate self-attention.” It came up with ten ideas, three of which I asked it to implement:

  1. Use more meaningful values
  2. Add visualizations for the weights
  3. Step-by-step attention computation

Use More Meaningful Values Link to heading

The original code used random values for word embeddings, which is standard for initializing a model for training but not very helpful for illustrating what is actually happening.

inputs = torch.tensor(
  [[0.43, 0.15, 0.89], # Your     (x^1)
   [0.55, 0.87, 0.66], # journey  (x^2)
   [0.57, 0.85, 0.64], # starts   (x^3)
   [0.22, 0.58, 0.33], # with     (x^4)
   [0.77, 0.25, 0.10], # one      (x^5)
   [0.05, 0.80, 0.55]] # step     (x^6)
)

After implementing the change to use more descriptive embeddings, it became much clearer how to think about what was happening.

# Create more meaningful embeddings where similar concepts have similar vectors
# Dimension 0: entity/concept (high for nouns/pronouns)
# Dimension 1: action/movement (high for verbs/action nouns)
# Dimension 2: grammatical/functional (high for function words)
inputs = torch.tensor(
    [
        [0.9, 0.1, 0.2],  # Your     - pronoun (entity)
        [0.8, 0.7, 0.1],  # journey  - noun (entity + action)
        [0.2, 0.9, 0.2],  # starts   - verb (action)
        [0.1, 0.1, 0.9],  # with     - preposition (functional)
        [0.7, 0.2, 0.3],  # one      - number (entity-like)
        [0.6, 0.5, 0.2],  # step     - noun (entity + some action)
    ]  
)

Add Visualizations for the Weights Link to heading

The visualizations were an interesting addition. While not critical to my own understanding in this instance, they provided a helpful visual representation of the attention scores.

Attention Weight Visualization:
(Higher weights shown with more filled blocks)

         Your     journey  starts   with     one      step
----------------------------------------------------------------
Your     ▓▓▓      ▓▓▓      ░        ░        ▒▒       ▒▒
journey  ▒▒       ▓▓▓      ▒▒       ·        ▒▒       ▒▒
starts   ░        ▓▓▓      ▓▓▓      ░        ░        ▒▒
with     ░        ░        ░        ▓▓▓      ▒▒       ░
one      ▒▒       ▒▒       ░        ░        ▒▒       ▒▒
step     ▒▒       ▓▓▓      ▒▒       ░        ▒▒       ▒▒

Legend: · (0-0.1) ░ (0.1-0.15) ▒▒ (0.15-0.2) ▓▓▓ (0.2-0.3) ████ (>0.3)

Step-by-Step Attention Computation Link to heading

Going over the step-by-step computation, I was starting to get a better feel for everything but couldn’t grasp why most words paid attention to themselves, but “one” and “step” did not—they paid more attention to “journey.”

=== Analyzing Attention Patterns ===

Highest attention weights for each token:
  Your     pays most attention to itself (weight=0.2108)
  journey  pays most attention to itself (weight=0.2349)
  starts   pays most attention to itself (weight=0.2279)
  with     pays most attention to itself (weight=0.2550)
  one      pays most attention to 'journey' (weight=0.1948)
           (self-attention weight: 0.1746)
  step     pays most attention to 'journey' (weight=0.2109)
           (self-attention weight: 0.1726)

Why do 'one' and 'step' attend more to 'journey' than themselves?
Let's examine the self-similarity scores (diagonal of attention score matrix):
  Your     self-similarity: 0.8600
  journey  self-similarity: 1.1400
  starts   self-similarity: 0.8900
  with     self-similarity: 0.8300
  one      self-similarity: 0.6200
  step     self-similarity: 0.6500

Key insights:
- 'journey' has the highest self-similarity (1.14) due to strong entity+action values
- 'one' (0.62) and 'step' (0.65) have weaker self-similarity
- After softmax normalization, 'journey' becomes an attention magnet
- This demonstrates how attention helps tokens 'borrow' context from semantically rich words

Key Insights Link to heading

Working through these improvements with Claude Code helped me understand several critical aspects of self-attention:

  1. Meaningful embeddings matter: Using semantically meaningful values instead of random numbers made the attention patterns much clearer and more intuitive.

  2. Attention as context borrowing: The most enlightening moment was discovering why “one” and “step” attended more to “journey” than themselves. This demonstrated how attention allows tokens to “borrow” context from other semantically rich words in the sequence.

  3. Self-similarity drives attention: The self-similarity scores (diagonal of the attention matrix) revealed why certain words become “attention magnets” after softmax normalization.

Conclusion Link to heading

This exploration reinforced my belief in the value of hands-on learning with AI assistance. What could have taken hours of solo experimentation was condensed into a productive hour of guided exploration. The combination of having a knowledgeable coding partner and the ability to quickly test ideas allowed me to move past surface-level understanding to genuine comprehension of how self-attention works.

The code from this exploration is available at https://github.com/jjshanks/llm_from_scratch/blob/main/book/ch3/simple_self_attention.py.