LLM Deep Dive • Part II of III
From Transformer to Modern LLMs
The architectural refinements and engineering innovations that enable today's systems. From decoder-only to Flash Attention and beyond.
The Decoder-Only Revolution
The original Transformer was an encoder-decoder architecture designed for translation. The encoder processed the source sentence with bidirectional attention (each position attends to all others). The decoder generated the target sentence autoregressively, with causal attention (each position only attends to previous positions) plus cross-attention to the encoder outputs.
Modern LLMs almost universally use decoder-only architectures. Why did encoder-decoder lose?
The Case for Decoder-Only
Unified Training: With decoder-only, everything is next-token prediction. You don't need paired data (source-target). Any text works.
Task Flexibility: Encoder-decoder assumes a clear input/output split. Decoder-only treats everything as a sequence to continue. Want translation? Prompt with "Translate to French: [text]". Want summarization? "Summarize: [text]". The task is implicit in the prompt.
Compute Efficiency: No separate encoder phase. During inference, you're running one forward pass per generated token, not encoder + decoder.
Interactive: Architecture Comparison
Compare how encoder-decoder and decoder-only architectures process the same task.
Architecture Comparison
Encoder-only vs Decoder-only vs Encoder-Decoder
Decoder-Only
Causal masking (each token only sees previous)
| Aspect | Encoder-Only | Decoder-Only | Enc-Dec |
|---|---|---|---|
| Context | Full bidirectional | Left-to-right only | Both |
| Generation | Limited | Native | Native |
| Understanding | Excellent | Good | Excellent |
| 2024 Trend | Embeddings/RAG | Dominant | Seq2seq niche |
The GPT Progression
GPT-1 (2018, 117M parameters) demonstrated that pre-training on next-token prediction followed by fine-tuning could achieve strong results. The insight: unsupervised pre-training provides a good initialization.
GPT-2 (2019, 1.5B parameters) showed zero-shot capabilities. Without any task-specific training, it could perform basic question answering, summarization, and translation. OpenAI initially declined to release the full model due to concerns about misuse.
GPT-3 (2020, 175B parameters) was the breakthrough that launched the current era. Few-shot learning worked: give the model a few examples in the prompt, and it could perform tasks it was never explicitly trained for. This was genuinely surprising.
GPT-4 (2023) introduced multimodality (images) and likely uses a Mixture of Experts architecture (though OpenAI hasn't confirmed details). Estimated at 1.7T parameters.
Tokenization — The Often-Overlooked Foundation
Tokenization might be the most underappreciated component of LLMs. It determines how text is chunked into the discrete units the model processes, and it has profound implications for model behavior.
The challenge: characters are too granular (long sequences, sparse learning signal), words are too sparse (huge vocabularies, out-of-vocabulary problems). Subword tokenization finds the middle ground.
Byte Pair Encoding (BPE)
BPE is elegantly simple. Start with a vocabulary of individual characters (or bytes). Iteratively find the most frequent adjacent pair of tokens and merge them into a new token. Repeat until you reach your target vocabulary size.
The result: common words become single tokens ("the" → "the"), rare words get split into subwords ("unconstitutional" → "un" + "constitu" + "tional"), and you can always fall back to individual characters for anything unknown.
Interactive: Tokenization Playground
Type any text to see how it gets tokenized. Compare different tokenizers and see the "token tax" for different languages.
Tokenization Playground
See how text gets split into tokens
| Tokenizer | Tokens | Tokens/Word | Efficiency |
|---|---|---|---|
| GPT-2 | 19 | 2.11 | 47% |
| GPT-4 | 19 | 2.11 | 47% |
| LLaMA | 19 | 2.11 | 47% |
| Claude | 19 | 2.11 | 47% |
Interactive: BPE Merge Visualization
Watch BPE build a vocabulary by iteratively merging the most frequent pairs.
BPE Merge Visualization
Watch how BPE builds a vocabulary by merging frequent pairs
The Real-World Impact
Tokenization quirks have real consequences:
- Token fertility: English averages ~1.3 tokens per word. Some languages require 2-3x more tokens for the same content, making them more expensive to process.
- Arithmetic failures: Numbers often get tokenized inconsistently. "1234" might be one token but "12345" might be "123" + "45". This makes arithmetic unreliable.
- Code handling: Whitespace-sensitive languages (Python) need careful tokenization to preserve indentation semantics.
Modern Positional Encoding — RoPE, ALiBi, and Beyond
Sinusoidal positional encodings worked, but they had limitations. The biggest: extrapolation. Train on sequences up to 2048 tokens, and performance degrades on longer sequences. The model hasn't seen those position values before.
RoPE: Rotary Position Embeddings
RoPE, used by LLaMA and many others, encodes position through rotation. Instead of adding positional information to embeddings, RoPE applies rotation matrices to query and key vectors.
The key insight: when you take the dot product of two rotated vectors, the result depends only on their angle difference—which is proportional to the position difference. Position becomes relative, encoded in the geometry of the operation itself.
Interactive: RoPE Rotation Visualizer
Visualize how RoPE encodes position through rotation. The dot product between Q and K depends only on their relative position.
RoPE Rotation Visualizer
See how Rotary Position Embeddings encode position
ALiBi: Attention with Linear Biases
ALiBi takes a radically different approach: no learned positional encoding at all. Instead, it adds a linear penalty to attention scores based on distance: positions farther apart get lower attention scores.
Different heads use different slopes, allowing some to focus locally and others to attend more globally. ALiBi extrapolates perfectly to any length (with the assumption that recency matters), though it can't learn arbitrary position-dependent patterns.
Attention Variants for Efficiency
The KV-cache is both a blessing and a curse. During autoregressive generation, we cache the key and value tensors from previous tokens to avoid recomputation. Essential for reasonable inference speed. But memory consumption scales with: layers × heads × head_dim × sequence_length × batch_size.
For a 70B model serving a 128K context, the KV-cache alone can consume 80+ GB—often more than the model weights themselves.
Multi-Query and Grouped-Query Attention
Multi-Query Attention (MQA) shares a single key and value head across all query heads. KV-cache shrinks by a factor of the number of heads. Quality degrades slightly (5-10%).
Grouped-Query Attention (GQA) is the compromise. Groups of query heads share KV heads. LLaMA 2 70B uses 64 query heads with 8 KV groups—an 8× reduction in KV-cache with quality within 1% of full attention after uptraining.
Interactive: KV-Cache Memory Calculator
Calculate KV-cache memory requirements for different model configurations and attention variants.
KV-Cache Memory Calculator
Calculate memory requirements for different attention variants
KV-Cache = 2 × layers × kv_heads × head_dim × seq_len × batch × bytes_per_elementSliding Window Attention
Sliding window attention restricts each token to only attend within a fixed window. O(n × w) complexity instead of O(n²), and the KV-cache has a fixed maximum size regardless of sequence length.
The trick: information can still propagate across the full sequence through layer stacking. With window w and L layers, the effective receptive field is L × w. Mistral uses this with "attention sinks"—always attending to the first few tokens to anchor the representation.
Interactive: Sliding Window Receptive Field
See how information propagates through layers with sliding window attention. The effective receptive field grows with depth.
Sliding Window Attention
See how receptive field grows through layers
Mixture of Experts — Scaling Parameters Without Scaling Compute
Here's the scaling dilemma: more parameters generally mean better performance, but more parameters also mean more compute per token. Mixture of Experts (MoE) breaks this link.
Replace the dense FFN with N expert FFNs. A router network selects the top-k experts for each token. Most parameters are inactive for any given token—you get the capacity of a large model with the compute cost of a smaller one.
Mixtral: A Concrete Example
Mixtral 8x7B has 8 experts with top-2 routing. Total parameters: 46.7B. Active parameters per token: ~12.9B. It matches or exceeds LLaMA 2 70B on most benchmarks while using similar compute to a 13B dense model.
Interactive: MoE Routing Visualization
Watch how tokens get routed to different experts. See load balancing in action.
MoE Routing Visualization
Loading...
The Load Balancing Challenge
Without careful design, the router might learn to always use the same few experts— "expert collapse." An auxiliary load balancing loss encourages even distribution:
The infrastructure implications are significant. All expert parameters must fit in memory (or be distributed with expert parallelism), but compute scales only with active parameters. This creates an unusual memory-compute trade-off.
Flash Attention — IO-Aware Algorithm Design
Flash Attention is perhaps my favorite example of infrastructure-aware algorithm design. The insight: on modern GPUs, compute is cheap but memory bandwidth is expensive.
An H100 has 3,958 TFLOPS of FP16 compute but only 3.35 TB/s of HBM bandwidth. Standard attention computes the n×n attention matrix, writes it to HBM, reads it back for softmax, writes again, reads for the final matmul. All that memory movement dominates runtime.
The Tiling Solution
Flash Attention never materializes the full n×n matrix. It processes Q, K, V in tiles that fit in SRAM (the GPU's fast on-chip memory). The key enabler is online softmax—computing softmax incrementally as new blocks arrive, tracking running statistics rather than needing all values upfront.
Interactive: Memory Access Patterns
Compare standard vs Flash Attention memory access patterns. See how tiling dramatically reduces HBM traffic.
Flash Attention Memory Access
Compare standard vs IO-aware attention
The results are dramatic:
- Memory: O(n) instead of O(n²)
- Speed: 2-9× faster depending on sequence length
- Efficiency: Flash Attention 2 achieves 50-73% of theoretical FLOPS
This is the kind of algorithm you get when systems engineers and ML researchers collaborate closely. The math is the same—the implementation just respects the hardware reality.
Training at Scale
Training a 70B+ parameter model requires careful orchestration of parallelism strategies. No single GPU can hold the model, and naive approaches waste compute or run out of memory.
The Parallelism Strategies
Data Parallelism: Each GPU holds a full model copy and processes different batches. Gradients are synchronized via all-reduce. Scales compute but not memory.
Tensor Parallelism: Split individual layers horizontally across GPUs. The attention and FFN weight matrices are sharded. Requires communication every layer—needs fast NVLink (400-900 GB/s).
Pipeline Parallelism: Split layers vertically—different GPUs own different layers. Communication only at stage boundaries. InfiniBand (200-400 Gb/s) is sufficient. Uses micro-batching to hide pipeline bubbles.
Interactive: Parallelism Strategies
Visualize how different parallelism strategies distribute model and data across GPUs.
Parallelism Strategy Visualizer
Compare DP, TP, PP, and FSDP approaches
Data Parallelism
Same model on each GPU, different data batches. Gradients synchronized.
- +Simple to implement
- +Linear scaling
- +Works with any model
- −Each GPU needs full model
- −Memory limited by single GPU
- −Gradient sync overhead
ZeRO: Memory Efficiency for Data Parallelism
ZeRO (Zero Redundancy Optimizer) eliminates memory redundancy in data parallelism. Standard DDP stores optimizer states, gradients, and parameters on every GPU. ZeRO shards them:
- Stage 1: Shard optimizer states (4× memory reduction for Adam)
- Stage 2: Also shard gradients
- Stage 3: Also shard parameters (requires gather before forward/backward)
Scaling Laws: The Chinchilla Revolution
In 2022, DeepMind's Chinchilla paper upended conventional wisdom. GPT-3 (175B params, 300B tokens) was undertrained. The compute-optimal ratio is roughly 20 tokens per parameter.
Chinchilla (70B params, 1.4T tokens) matched GPT-3's performance with 4× fewer parameters— smaller, faster to run, trained with the same compute budget but allocated differently.
Interactive: Scaling Law Explorer
Explore the compute-optimal frontier. See where different models fall relative to the Chinchilla-optimal line.
Scaling Law Explorer
Explore the Chinchilla compute-optimal frontier
Inference Optimization
Training costs are one-time. Inference costs are forever. For widely-used models, inference optimization matters enormously for both cost and user experience.
The KV-Cache Lifecycle
During autoregressive generation, each new token requires attending to all previous tokens. Without caching, you'd recompute K and V for the entire context on every step—O(n²) total compute.
With KV-caching, you compute K and V once per token and store them. Each generation step then only needs O(n) compute. The trade-off: memory consumption grows linearly with context length.
Interactive: Token Generation Animation
Watch the autoregressive generation process step by step. See the KV-cache grow as tokens are generated.
Token Generation Animation
Watch autoregressive generation with KV-cache growth
PagedAttention: Virtual Memory for KV-Cache
PagedAttention (introduced in vLLM) applies virtual memory concepts to KV-cache management. Instead of pre-allocating the maximum possible sequence length, allocate memory in small blocks on demand.
A page table maps logical KV positions to physical memory blocks. Benefits:
- Near-zero memory fragmentation
- Memory allocated only as needed
- Prefix sharing: requests with common prefixes share KV blocks
Interactive: PagedAttention Visualization
See how PagedAttention manages memory dynamically, allocating and recycling blocks as sequences progress.
PagedAttention Visualizer
Watch dynamic KV-cache block allocation
Speculative Decoding
Speculative decoding uses a small, fast "draft" model to propose multiple tokens, then the large "target" model verifies them in parallel. If the draft matches, you've generated multiple tokens in essentially one target forward pass.
Typical speedups: 2-3× for latency. Works best when the draft model is a good approximation— you can use a quantized version of the target, a distilled variant, or a specialized small model.
Interactive: Speculative Decoding Demo
Watch how draft and target models work together to accelerate generation.
Speculative Decoding Demo
Draft model proposes, target model verifies
Long Context — The Frontier
Standard attention is O(n²) in both memory and compute. At 128K tokens, that's 16 billion attention elements per layer. The quadratic wall is the fundamental bottleneck for long-context processing.
Interactive: Quadratic Complexity Visualizer
See how attention complexity scales with sequence length. The quadratic growth quickly becomes prohibitive.
Attention Complexity Visualizer
See how O(n²) quadratic complexity scales
| Sequence | O(n²) Ops | O(n) Ops | Memory (n²) |
|---|---|---|---|
| 1.0K | 1.0M | 1.0K | 2.0 MB |
| 4.1K | 16.8M | 4.1K | 32.0 MB |
| 16.4K | 268.4M | 16.4K | 512.0 MB |
| 65.5K | 4.3B | 65.5K | 8.0 GB |
| 262.1K | 68.7B | 262.1K | 128.0 GB |
| 1.0M | 1.1T | 1.0M | 2048.0 GB |
Sparse Attention Patterns
Sparse attention reduces complexity by only computing a subset of attention pairs:
- Longformer: Local sliding window + global tokens
- BigBird: Random + window + global patterns
- Theoretical result: sparse patterns can be universal approximators
Interactive: Sparse Attention Patterns
Compare different sparse attention patterns and their coverage of the full attention matrix.
Sparse Attention Patterns
Compare full vs sparse attention strategies
Full Attention
Every token attends to every other token. Baseline approach.
| Pattern | Complexity | Long Context | Trade-off |
|---|---|---|---|
| Full Attention | O(n²) | ❌ Limited | Quality baseline |
| Sliding Window | O(n × w) | ⚠️ Local only | Fast but may miss long-range |
| Dilated/Strided | O(n × w) | ✓ Expanded | Better coverage, fixed pattern |
| Longformer | O(n × w + g × n) | ✓ Global + Local | Best of both worlds |
| BigBird | O(n × (w + g + r)) | ✓ Global + Local | Random helps coverage |
State Space Models and Alternatives
State Space Models like Mamba offer a different paradigm: O(n) training and inference with O(1) memory during inference. They use structured recurrence that can be computed efficiently via convolution during training.
Hybrid approaches like Jamba combine SSM layers with Transformer layers, getting benefits of both architectures.
RAG: Sidestepping the Problem
Retrieval-Augmented Generation avoids the long-context problem entirely. Instead of fitting everything into context, retrieve relevant chunks from an external knowledge base and include only those.
Trade-offs: latency (retrieval adds time), retrieval quality (wrong chunks = wrong answers), and complexity (another system to maintain).
Practical Infrastructure Decisions
Let's translate all this technical understanding into practical deployment decisions. After years of deploying ML models, I've learned that the gap between "works in notebook" and "works in production" is vast.
Memory Planning
Start with back-of-envelope calculations:
- Model weights: params × bytes_per_param (2 for FP16, 1 for INT8)
- KV-cache: 2 × layers × kv_heads × head_dim × seq_len × batch × bytes
- Activations: varies, but budget 10-20% overhead
Parallelism Strategy Selection
Rules of thumb:
- < 15B: Single GPU or simple data parallelism
- 15-70B: Tensor parallelism within a node (NVLink required)
- 70-200B: Add pipeline parallelism across nodes
- > 200B: Full 3D parallelism (DP + TP + PP)
Interactive: Deployment Calculator
Plan your deployment: enter model specs and requirements, get hardware recommendations.
LLM Deployment Calculator
Plan hardware requirements for your deployment
- •KV-cache dominates memory; consider GQA model or shorter context
Key Metrics to Track
Time to First Token (TTFT): Interactive latency. Users notice delays beyond 500ms. Critical for chat applications.
Inter-Token Latency (ITL): Streaming smoothness. Should stay under 50ms for natural-feeling output.
Throughput: Tokens per second across all requests. The key metric for cost optimization. Continuous batching and speculative decoding can dramatically improve this.