Deep Dive Series
How Transformer LLMs Actually Work
A comprehensive, interactive journey from attention mechanisms to production deployment. Built by someone who's been writing code since the Commodore 64 era.
14 chapters across 3 parts
~45 min total read
23 interactive visualizations
Choose where to start
Part IStart here
Setting the Stage
3 chapters: The Problem That Needed Solving, Attention Is All You Need, Training Objective
~15 min read • 7 interactive visualizations
Part II
From Transformer to Modern LLMs
10 chapters: Decoder-Only Revolution, Tokenization, Modern Positional Encoding + 7 more
~25 min read • 16 interactive visualizations
Part III
What's Emerging
6 chapters: The Reasoning Revolution, State Space Models, The Efficiency Frontier + 3 more
~5 min read • Series finale
What You'll Learn
Part I: Foundations
- • Why RNNs failed and how attention solved it
- • Self-attention, multi-head attention, positional encoding
- • The surprisingly simple training objective
- • Emergent capabilities at scale
Part II: Modern Architecture
- • Decoder-only vs encoder-decoder
- • Tokenization (BPE, WordPiece)
- • RoPE, ALiBi, and modern positional encoding
- • GQA, MQA, and attention variants
Part II: Efficiency & Scale
- • Mixture of Experts (MoE) architecture
- • Flash Attention and IO-aware algorithms
- • Distributed training (DP, TP, PP, ZeRO)
- • Inference optimization and speculative decoding
Part III: The Future
- • Long context handling strategies
- • State Space Models (Mamba, Jamba)
- • The agent paradigm and tool use
- • Key formulas and reference materials
Ready to dive in?
Start with Part I to understand the foundations, or jump to any section that interests you. Each part stands on its own while building on previous concepts.
Start with Part I