LLM Deep Dive • Part I of III

Setting the Stage

Understanding why Transformers were invented by understanding what they replaced. From vanishing gradients to the attention revolution.

3 chapters~15 min read7 interactive visualizations
Chapter 1

The Problem That Needed Solving

I'd been implementing neural networks for years before the Transformer came along. I remember the excitement when LSTMs first clicked for me—finally, a way to model sequences that didn't immediately forget everything! But I also remember the frustration: training was slow, sequences longer than a few hundred tokens were problematic, and there was always this nagging sense that we were fighting the architecture rather than working with it.

Then in June 2017, a paper with possibly the most provocative title in machine learning history dropped: "Attention Is All You Need." The claim was audacious—throw out recurrence entirely and rely solely on attention mechanisms. I was skeptical. How could you model sequences without... sequence processing?

To understand why Transformers were such a breakthrough, we need to understand what they replaced and why that replacement was necessary.

The Sequential Processing Dilemma

Language has order. "Dog bites man" means something very different from "Man bites dog." Any architecture that processes language needs to respect this ordering. The natural approach, which dominated for years, was to process sequences one element at a time.

Recurrent Neural Networks (RNNs) embodied this intuition perfectly. At each timestep t, the network computes a hidden state h_t as a function of the previous hidden state h_{t-1} and the current input x_t:

h_t = f(h_{t-1}, x_t)

This is elegant. The hidden state acts as a "memory" that carries context forward through the sequence. Process "The cat sat on the" token by token, and by the time you reach "the," your hidden state hopefully encodes something useful about cats and sitting.

But there's a fundamental problem: sequential computation can't be parallelized. You can't compute h_{10} until you've computed h_9, which requires h_8, and so on. Your expensive GPU with thousands of cores? Most of them sit idle, waiting for the sequential chain to complete.

The Vanishing Gradient Problem

Even worse than the parallelization issue is the vanishing gradient problem. When we train neural networks with backpropagation, gradients flow backward through the network. For RNNs, this means gradients must flow backward through time—through every single timestep.

Here's the mathematical reality: when you backpropagate through a recurrent connection, you multiply by the Jacobian matrix of the hidden state transition. If the eigenvalues of this matrix are less than 1, gradients shrink exponentially. If greater than 1, they explode exponentially.

In practice, this meant RNNs could only effectively learn dependencies spanning maybe 10-20 timesteps. Need to connect a pronoun to its antecedent 50 words back? Good luck.

Interactive: Vanishing Gradient Visualizer

Adjust the multiplicative factor to see how gradients behave as they propagate through time. Factor < 1 causes vanishing, > 1 causes explosion.

Vanishing Gradient Visualizer

See how gradients propagate through time in RNNs

0.5 (Vanishing)1.0 (Stable)1.5 (Exploding)
Stable Gradients
Gradient Magnitude at Each TimestepGradient flows ← from output to input
100%
50%
10%
Step 0← Earlier layers (harder to train)Step 19
Initial Gradient
100%
After 10 Steps
34.9%
Final (20 Steps)
13.51%
The Math
gradient(t) = gradient(0) × factor^t = 1.0 × 0.90^20 = 12.1577%
Why This Matters: When gradients vanish, early layers receive near-zero learning signals. The network can't learn long-range dependencies because information from distant timesteps can't influence weight updates. LSTMs mitigate this with additive cell state updates.

LSTMs and GRUs: A Partial Solution

Long Short-Term Memory networks were designed specifically to address vanishing gradients. The key innovation was the "cell state"—a highway that allows information to flow through time with minimal transformation. Instead of a single hidden state being multiplied at each step, LSTMs use gating mechanisms:

  • Forget gate: decides what to discard from the cell state
  • Input gate: decides what new information to store
  • Output gate: decides what to output based on the cell state

The cell state updates through addition rather than multiplication, which helps gradients flow more easily. LSTMs could handle dependencies of maybe 100-200 tokens—much better than vanilla RNNs, but still fundamentally limited.

And they were still sequential. Still couldn't parallelize. Still left most of your GPU idle.

The Attention Insight

The breakthrough came from an unexpected direction: machine translation. In 2014, Bahdanau and colleagues introduced attention for sequence-to-sequence models. The idea was simple but profound: instead of forcing all information through a fixed-size hidden state bottleneck, let the decoder "look back" at all encoder states and focus on the relevant ones.

This was attention as an add-on to RNNs. The Transformer paper asked: what if attention was the only mechanism? What if we threw out recurrence entirely?

Interactive: RNN vs Attention Path Length

Compare how information flows in RNNs (sequential, O(n) path length) versus attention (direct connections, O(1) path length). Adjust sequence length to see the difference.

RNN vs Attention Path Length

Compare how information flows in sequential vs parallel architectures

RNN: Sequential ProcessingO(n) path length
1 / 8 steps
0
t0
1
t1
2
t2
3
t3
4
t4
5
t5
6
t6
7
t7
Attention: Parallel ProcessingO(1) path length
Waiting...
0
t0
1
t1
2
t2
3
t3
4
t4
5
t5
6
t6
7
t7
RNN Path Length
7
O(n) - grows with sequence
Attention Path Length
1
O(1) - constant
Speedup Factor
7×
For information propagation
Why Path Length Matters: In RNNs, information from position 0 must pass through 7 hidden states to reach position 7. Each step is a chance for information loss and gradient decay. Attention provides direct O(1) connections between any positions, enabling learning of long-range dependencies.

The visualization above captures the key insight. In an RNN, if position 1 needs to influence position 20, the signal must pass through 19 hidden states—19 chances for information to be corrupted or lost. With attention, position 1 can directly attend to position 20. The path length is constant, regardless of sequence length.

This isn't just theoretically elegant—it's practically transformative. Shorter paths mean better gradient flow. Direct connections mean the model can learn arbitrary dependencies without fighting the architecture.

And crucially: all attention computations for all positions can happen in parallel. Your GPU is finally earning its keep.

Chapter 2

"Attention Is All You Need" — The 2017 Revolution

Let's build up the Transformer architecture piece by piece. I find that the individual components make intuitive sense, but it took me a while to see how they fit together into something greater than the sum of its parts.

Self-Attention: The Core Innovation

Self-attention is the mechanism that lets each position in a sequence attend to all other positions. Consider the sentence: "The animal didn't cross the street because it was too tired." What does "it" refer to? The animal or the street?

For humans, this is easy—"tired" makes it clear we're talking about the animal. For a model, this requires connecting "it" to "animal" across multiple intervening words. Self-attention provides exactly this capability: when processing "it," the model can directly attend to "animal" and learn to recognize this pattern.

Query, Key, Value: The Information Retrieval Analogy

The Query-Key-Value framework is perhaps the cleverest abstraction in the Transformer. Think of it like a database lookup:

  • Query (Q): "What am I looking for?"
  • Key (K): "What do I contain?"
  • Value (V): "What do I return if you match me?"

Each position generates all three vectors by projecting its embedding through learned weight matrices W_Q, W_K, and W_V. Then:

  1. The query from position i is compared against keys from all positions
  2. This comparison produces attention scores (how relevant is each position?)
  3. Scores are normalized via softmax to get attention weights
  4. Values are combined using these weights to produce the output

Mathematically, for an input X with n tokens and d dimensions:

Q = X · W_Q, K = X · W_K, V = X · W_V
Attention(Q, K, V) = softmax(Q · K^T / √d_k) · V

That √d_k scaling factor isn't arbitrary. Without it, when d_k is large, the dot products grow large, pushing softmax into regions where it has extremely small gradients. The scaling keeps the variance of the dot products at 1, keeping softmax in a well-behaved regime.

Interactive: Attention Score Calculator

Step through the attention computation on a small example. See how Q, K, V matrices are formed and how attention weights emerge.

Attention Score Calculator

Step through the attention computation

Step 1: Input Embeddings
0: The1: cat2: sat3: down
We start with input embeddings X ∈ ℝ^(4×4) — 4 tokens with 4-dimensional embeddings.
X (Input Embeddings)
0.500.20-0.300.80
0.100.900.40-0.20
-0.400.300.700.10
0.60-0.100.200.50
Complete Formula
Attention(Q, K, V) = softmax(Q × K^T / √d_k) × V

Multi-Head Attention: Multiple Perspectives

A single attention head learns one notion of "relevance"—maybe syntactic structure, maybe semantic similarity, maybe something else entirely. Multi-head attention runs multiple attention operations in parallel, each with its own learned projections.

In practice, heads specialize. Research has shown that different heads learn to track different types of relationships: some focus on adjacent tokens, some on syntactic dependencies, some on semantic similarity across long distances.

MultiHead(Q, K, V) = Concat(head_1, ..., head_h) · W_O
where head_i = Attention(Q · W_Q^i, K · W_K^i, V · W_V^i)

Interactive: Multi-Head Attention Patterns

Explore how different attention heads specialize. Each heatmap shows which tokens attend to which other tokens for a given head.

Multi-Head Attention Patterns

See how different heads specialize

Input Sentence
Thecatsatonthematbecauseitwastired
Head 3: Coreference
The
cat
sat
on
the
mat
because
it
was
tired
The
62
cat
62
sat
62
on
62
the
62
mat
62
because
62
it
66
was
63
tired
37
Low attention
High attention
Rows: query tokens • Columns: key tokens
Head Specialization: Different attention heads learn different patterns. Some track local context, others handle coreference ("it" → "cat"), syntax, or global relationships. This diversity is why multi-head attention outperforms single-head.

Positional Encoding: Where Am I?

Here's a subtle but critical point: attention is permutation equivariant. If you shuffle the input tokens, attention doesn't care—it produces shuffled outputs. But "Dog bites man" ≠ "Man bites dog"!

Positional encodings inject position information into the model. The original Transformer used sinusoidal functions:

PE(pos, 2i) = sin(pos / 10000^(2i/d))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d))

Why sinusoids? They have a beautiful property: the encoding for position pos+k can be expressed as a linear function of the encoding for position pos. This means the model can potentially learn to attend to relative positions.

Interactive: Positional Encoding Visualizer

Visualize how sinusoidal positional encodings create unique patterns for each position. Different dimensions oscillate at different frequencies.

Positional Encoding Visualizer

Explore sinusoidal position encodings

Wave Patterns Across Dimensions
Lower dimensions oscillate faster • Higher dimensions oscillate slower
Encoding at Position 10
-1 0 +1
Position 10
-0.5
-0.8
0.9
0.3
-0.6
0.8
-0.9
-0.5
-0.0
-1.0
0.7
-0.7
1.0
-0.2
1.0
0.2
Position 15 (offset +5)
0.7
-0.8
-1.0
0.3
0.8
-0.5
0.0
1.0
-1.0
0.0
-0.4
-0.9
0.5
-0.9
0.9
-0.4
Sinusoidal Encoding Formula
PE(pos, 2i) = sin(pos / 10000^(2i/d))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d))
Unique Patterns

Each position has a unique encoding pattern

Relative Position

PE(pos+k) can be expressed as linear function of PE(pos)

Multi-Scale

Low dims: local patterns • High dims: global patterns

Feed-Forward Networks and Layer Norm

After attention, each position passes through a position-wise feed-forward network. This is just a two-layer MLP, but it's applied independently to each position:

FFN(x) = max(0, x · W_1 + b_1) · W_2 + b_2

The inner dimension is typically 4× the model dimension—this expansion provides representational capacity that attention alone lacks. Modern variants use GELU or SwiGLU activations instead of ReLU.

Residual connections wrap both the attention and FFN sublayers, and layer normalization ensures training stability. The "pre-norm" variant (norm before the sublayer) has become standard as it provides better gradient flow.

The Complete Picture

Stack N of these layers (N=6 in the original paper, N=32-80+ in modern LLMs), and you have a Transformer. Each layer refines the representations: early layers might capture syntactic structure, later layers might encode more abstract semantics.

What struck me when I first implemented this was how simple the individual pieces are. Matrix multiplications, softmax, layer norm, ReLU. Nothing exotic. The magic is in the combination and scale.

Chapter 3

The Training Objective and What Models Actually Learn

Here's what continues to amaze me about modern LLMs: the training objective is almost laughably simple. Next-token prediction. Given the tokens you've seen so far, predict the probability distribution over what comes next.

That's it. No explicit teaching of grammar, no labeled examples of reasoning, no curriculum of skills. Just: predict the next token, billions of times, on trillions of tokens of text.

Why This Works: A Deep Connection

The mathematical justification is elegant. If you model P(x_t | x_1, ..., x_{t-1}) well for all positions in all texts, you're implicitly modeling the entire distribution of human text. Grammar rules? They're just patterns that make certain next tokens more likely. Factual knowledge? It's encoded in which continuations are probable. Reasoning? It's the patterns that connect premises to conclusions.

The loss function is cross-entropy:

L = -Σ log P(x_t | x_1, ..., x_{t-1})

Perplexity, the standard metric, is just exp(loss). A perplexity of 10 means the model is, on average, as uncertain as if choosing uniformly among 10 options. Modern LLMs achieve perplexities of 8-15 on standard benchmarks.

Interactive: Next-Token Prediction

Type a prompt and see the probability distribution over possible next tokens. Adjust temperature to see how it affects the distribution.

Next-Token Prediction

See how LLMs predict the next token

Current Context
The capital of France is|
1.00
Greedy (0)Default (1)Creative (2)
Probability DistributionTop 10 predictions
"Paris"
87.7%
"the"
4.0%
"a"
2.2%
"known"
1.6%
"located"
1.3%
"famous"
1.0%
"beautiful"
0.8%
"one"
0.6%
"called"
0.5%
"home"
0.4%
Top Token Confidence
87.7%
Perplexity (approx)
1.14
Tokens Generated
0
Temperature Effect: Low temperature (→0) makes the model more deterministic, always picking the highest probability token. High temperature (→2) flattens the distribution, making unlikely tokens more likely to be sampled.

Emergent Capabilities: Scale Changes Everything

Emergent capabilities are perhaps the most surprising aspect of LLM scaling. These are abilities that appear suddenly at certain scales but are absent in smaller models.

GPT-2 (1.5B parameters) could generate coherent paragraphs but struggled with few-shot learning. GPT-3 (175B parameters) could solve new tasks from just a few examples in the prompt—a capability that wasn't explicitly trained and that researchers didn't fully anticipate.

Chain-of-thought reasoning emerged similarly. Tell a large enough model to "think step by step," and it produces intermediate reasoning that improves final answers. Smaller models just produce nonsense when prompted the same way.

These aren't linear improvements—they're phase transitions. The capability is essentially zero, then suddenly it's there. We still don't fully understand why.

Interactive: Emergent Capability Timeline

Explore how capabilities emerged as models scaled from GPT-1 to GPT-4 and beyond.

Emergent Capabilities Timeline

Click on a model to explore its capabilities

GPT-3

2020175B parameters
Major Milestone
Few-Shot LearningNew

Learn new tasks from just a few examples in the prompt

Basic ArithmeticNew

Perform simple calculations

Code GenerationNew

Generate simple code snippets

TranslationNew

Translate between languages without specific training

Phase Transitions: Emergent capabilities don't appear gradually—they emerge suddenly at certain scales. Few-shot learning was nearly absent in GPT-2 but transformative in GPT-3, despite only a ~100× parameter increase.