Deep Dive Series

How Transformer LLMs Actually Work

A comprehensive, interactive journey from attention mechanisms to production deployment. Built by someone who's been writing code since the Commodore 64 era.

14 chapters across 3 parts
~45 min total read
23 interactive visualizations

What You'll Learn

Part I: Foundations

  • • Why RNNs failed and how attention solved it
  • • Self-attention, multi-head attention, positional encoding
  • • The surprisingly simple training objective
  • • Emergent capabilities at scale

Part II: Modern Architecture

  • • Decoder-only vs encoder-decoder
  • • Tokenization (BPE, WordPiece)
  • • RoPE, ALiBi, and modern positional encoding
  • • GQA, MQA, and attention variants

Part II: Efficiency & Scale

  • • Mixture of Experts (MoE) architecture
  • • Flash Attention and IO-aware algorithms
  • • Distributed training (DP, TP, PP, ZeRO)
  • • Inference optimization and speculative decoding

Part III: The Future

  • • Long context handling strategies
  • • State Space Models (Mamba, Jamba)
  • • The agent paradigm and tool use
  • • Key formulas and reference materials

Ready to dive in?

Start with Part I to understand the foundations, or jump to any section that interests you. Each part stands on its own while building on previous concepts.

Start with Part I