LLM Deep Dive • Part III of III

What's Emerging

The cutting edge of LLM development: reasoning models, state space architectures, extreme efficiency, and the rise of agentic AI. Plus a comprehensive reference guide.

6 chapters + Reference•~20 min read•6 interactive visualizations

Chapter 14

The Reasoning Revolution

In September 2024, something shifted. OpenAI released o1, and suddenly we had a "whole new knob" to turn—one that traded inference compute for capability. The field hasn't been the same since.

The core insight behind test-time compute scaling is deceptively simple: let the model think longer before answering. Instead of immediately producing a response, the model generates internal reasoning traces—sometimes thousands of tokens of "thinking" before producing a final answer.

Chain-of-Thought and Beyond

Chain-of-thought prompting showed us that intermediate reasoning improves results. Tell a model to "think step by step," and accuracy on math problems jumps significantly. But that was prompting—we were asking the model to show its work.

What o1 and DeepSeek-R1 demonstrated is that you can train models specifically for extended reasoning using reinforcement learning from verifiable rewards (RLVR). The model learns when to think more, when to backtrack, and when it has enough evidence to commit to an answer.

Tree-of-thought takes this further: instead of a linear chain, the model explores multiple reasoning paths in parallel. It's like the difference between following a single trail through the woods versus exploring several forks and choosing the best route.

Interactive: Reasoning Strategies Compared

Compare how different reasoning approaches affect accuracy and token usage. Animate the thinking process to see each strategy in action.

Reasoning Strategies Compared

See how different reasoning approaches affect accuracy

Problem

If a train travels 120 miles in 2 hours, then stops for 30 minutes, then travels 90 miles in 1.5 hours, what is its average speed for the entire journey?

Observe

Let me break this down step by step.

Observe

First segment: 120 miles in 2 hours

Observe

Second segment: 90 miles in 1.5 hours

Observe

Rest period: 30 minutes = 0.5 hours

Calculate

Total distance = 120 + 90 = 210 miles

Calculate

Total time = 2 + 0.5 + 1.5 = 4 hours

Calculate

Average speed = 210 ÷ 4 = 52.5 mph

Conclude

The average speed for the entire journey is 52.5 mph.

Click "Animate Steps" to see the reasoning process

Accuracy

78%

Tokens Used

156

Reasoning Steps

Accuracy/Token

0.50

Key Insight: More reasoning tokens generally improve accuracy, but there are diminishing returns. Chain-of-Thought with self-verification offers the best accuracy-per-token ratio for most problems. Tree-of-Thought excels on ambiguous problems where multiple interpretations exist.

The Test-Time Compute Trade-off

Here's what I find fascinating: we now have two scaling laws to play with. The original Chinchilla-era scaling law told us how much data to use for a given amount of training compute. The new inference-time scaling law tells us how much thinking to do for a given problem difficulty.

The trade-offs are real. A reasoning model might take 10-30 seconds to answer a complex question that a standard model handles in 1 second. But on difficult problems—competition math, formal proofs, complex code—the accuracy gains are dramatic. DeepSeek-R1 achieves ~80% on AIME (American Invitational Mathematics Examination) problems. That would put it in the top tier of human competitors.

Interactive: Test-Time Compute Scaling

Adjust the "thinking budget" to see how accuracy improves with more inference compute. Notice the diminishing returns at high token counts.

Test-Time Compute Scaling

More thinking time → better answers (with trade-offs)

Problem Type

Thinking Token Budget1,000 tokens

0 (instant)↑ Optimal: ~020,000 (max)

Accuracy vs Thinking Time

100%75%50%25%

05K10K15K20K tokens

Accuracy

84.0%

Latency

6.8s

Cost Multiplier

1.5×

Efficiency

12.4

The Test-Time Compute Trade-off: Reasoning models like o1 and DeepSeek-R1 can "think longer" by generating more internal reasoning tokens before answering. This creates a new scaling dimension: instead of just training longer, you can also think longer at inference time. The curve shows diminishing returns—the first 1,000 thinking tokens provide most of the benefit.

Open-Source Reasoning: DeepSeek-R1

DeepSeek-R1 (January 2025) was a milestone for open-source AI. It demonstrated that reasoning capabilities can emerge from pure reinforcement learning—no supervised fine-tuning step required. The model learns to "think" through trial and error on verifiable tasks.

The architecture is elegant: 671B total parameters using Mixture of Experts, but only 37B active per token. Training cost? Around $5.5M—a fraction of GPT-4's estimated budget. The distilled variants (ranging from 1.5B to 70B parameters) make reasoning accessible on consumer hardware.

What excites me most is that this is now open. Researchers can study how reasoning emerges, what the thinking traces look like, and how to improve them. The proprietary advantage of reasoning models just narrowed significantly.

Chapter 15

State Space Models — Beyond the Transformer

For seven years, Transformers have been the only game in town for serious language modeling. Mamba, introduced in late 2023, changed that. For the first time, we have a fundamentally different architecture that can match Transformer quality on language tasks.

The Efficiency Promise

Remember the quadratic attention problem from Part II? Standard attention is O(n²) in both compute and memory. At 1M tokens, that's a trillion attention elements per layer. Even with Flash Attention and sparse patterns, we're fighting the architecture.

State Space Models take a different approach. Instead of computing attention over all positions, they maintain a fixed-size hidden state that gets updated as tokens are processed. Training complexity: O(n). Inference memory: O(1) per token. The state acts as a "compressed summary" of all previous tokens.

Selective State Spaces: The Mamba Innovation

Classic SSMs have a problem: the state transition is fixed (input-independent). This means they compress everything the same way, regardless of content. Mamba introduces selective state spaces: the state transition depends on the input, allowing the model to selectively remember relevant information and forget irrelevant details.

The results speak for themselves: Mamba matches equivalently-sized Transformers on language modeling benchmarks while being significantly faster. On byte-level sequences (where sequences are ~4× longer than BPE), Mamba dramatically outperforms Transformers—the efficiency advantage compounds with sequence length.

Interactive: SSM vs Transformer Architecture

Compare how Transformers and SSMs scale with sequence length. Adjust the sequence length to see relative compute and memory requirements.

SSM vs Transformer Architecture

Compare scaling characteristics of different architectures

Sequence Length16,000 tokens

How Information Flows

Transformer: Full Attention

Every token attends to every other token

SSM: Compressed State

State

Information compressed into fixed-size state

Capability Ratings

Transformer

Long Context

Local Context

Mamba (SSM)

Long Context

Local Context

Hybrid (Jamba)

Long Context

Local Context

Key Insight: Transformers excel at precise local attention but struggle with long sequences due to O(n²) scaling. SSMs like Mamba use O(1) memory during inference via compressed state, enabling million-token contexts. Hybrid architectures (Jamba, etc.) combine both: SSM layers for efficiency, attention layers for precise retrieval.

The Hybrid Future

Transformers aren't going anywhere. They're exceptionally good at precise local attention—knowing exactly which token 50 positions back is relevant to the current token. SSMs excel at efficient long-range context but can lose precision on specific lookups.

Hybrid architectures like Jamba combine both: SSM layers handle the bulk of the sequence efficiently, while periodic attention layers provide the precision needed for specific retrieval. IBM's Granite 4.0 and AI21's Jamba series use this approach.

Mamba-2 (2024) made an even more interesting discovery: there's a deep mathematical connection between SSMs and attention. They're not as different as they appear—just different parameterizations of similar operations. This "state space duality" suggests we're just beginning to understand the design space.

My take: we're not in the post-Transformer era yet. But for the first time, we're in the post-only-Transformers era. That's significant.

Chapter 16

The Efficiency Frontier

Here's an economic reality: training costs are one-time, but inference costs are forever. Every query to ChatGPT, every Claude conversation, every GitHub Copilot suggestion—they all cost compute. As LLMs become ubiquitous, quantization and efficiency matter more than ever.

The Quantization Revolution

When I started working with neural networks, FP32 was standard. Then FP16 halved memory requirements with negligible quality loss. Now we're pushing further: FP8 is becoming the standard for inference (DeepSeek-V3 trains in FP8), INT4 enables running 70B models on consumer GPUs, and researchers are exploring 2-bit and even 1-bit precision.

FP4 training crossed a milestone in 2025: a 7B model trained entirely in FP4 precision matched the quality of a BF16 baseline. The key innovations were a differentiable quantization estimator and careful handling of activation outliers. This matters because FP4 training means 4× less memory pressure during training—larger models on the same hardware.

Interactive: Quantization Impact Calculator

See how different quantization levels affect memory requirements, speed, and quality. Compare what hardware you need for different model sizes.

Quantization Impact Calculator

See memory savings and quality trade-offs

Model Size

Quantization Level

70B Model at FP16/BF16

Memory Required

140.0 GB

FP16: 140GB

Relative Speed

1.0×

Quality Retained

100.0%

Memory Savings

H100 80GB GPUs

RTX 4090 24GB

Bits per Weight

For Production: INT8 or FP8 offers the best balance—2× memory reduction with negligible quality loss. Most inference frameworks (vLLM, TensorRT-LLM) support these out of the box.

For Local/Edge: INT4 (GPTQ, AWQ, GGUF) enables running 70B models on consumer hardware. Quality loss is noticeable but acceptable for many use cases.

Emerging (2025): FP4 training is now viable—7B models trained entirely in FP4 match BF16 quality. 1-bit models (BitNet) require training from scratch but enable CPU inference.

The 1-Bit Frontier: BitNet

BitNet research asks: how far can we push? The "Era of 1-bit LLMs" paper showed that models with 1.58-bit weights (ternary: -1, 0, +1) can match full-precision quality—but only when trained from scratch in that format. You can't just quantize an existing model to 1-bit and expect it to work.

Why does this matter? 1-bit weights mean matrix multiplication becomes addition. No need for expensive floating-point units. In theory, 1-bit models could run efficiently on CPUs, even on edge devices. The catch: you need to train the model from scratch in this format, which few have the resources to do.

Distillation at Scale

Knowledge distillation—training a small model to mimic a large one—has become a standard technique. DeepSeek-R1's distilled variants are a perfect example: they took a 671B reasoning model and created 1.5B, 7B, 14B, and 70B versions that retain much of the reasoning capability.

The approach: generate 800,000 high-quality reasoning traces from the large model, then fine-tune smaller models (LLaMA 3.1, Qwen 2.5) on this data. The student doesn't just learn the answers—it learns the reasoning process. This is how we democratize capability: train one large model, distill to many small ones.

Combining techniques—distill a 70B to 13B, quantize to 4-bit, apply 20% sparsity—can yield 85% size reduction with less than 5% accuracy loss. The efficiency gains compound.

Chapter 17

Multimodal Convergence

LLMs started as text models. By 2025, that framing feels outdated. The frontier models—Gemini, GPT-4o, Claude 3—process text, images, audio, and video through unified architectures. We're witnessing the convergence of perception and language.

Native vs. Fusion Architectures

There are two schools of thought for building multimodal models. Encoder fusion (LLaVA, BLIP-2, MiniCPM-V) takes pretrained components—a vision encoder like ViT, a language model—and connects them with a projection layer. Quick to build, easy to iterate, and you can swap components.

Native multimodal training (Gemini, GPT-4o) trains a single model on all modalities from scratch. This is expensive—orders of magnitude more compute—but produces tighter integration. Cross-modal reasoning is more natural because the model learned these relationships during pretraining, not through a shallow projection layer.

The benchmark gap is narrowing. MiniCPM-V, an 8B encoder-fusion model, now outperforms GPT-4V on several benchmarks while running on mobile phones. But native models still lead on complex reasoning that requires deep cross-modal understanding.

Interactive: Multimodal Architecture Explorer

Compare different approaches to multimodal AI. See which modalities each architecture supports and their trade-offs.

Multimodal Architecture Explorer

How vision, audio, and text combine in modern models

Text

Vision

Audio

Video

Native Multimodal Architecture

Text

Vision

Audio

Video

Unified Tokenizer

Shared Transformer

Multi-head Output

Unified Output

Examples:GeminiGPT-4oClaude 3

Advantages

Deep cross-modal understanding
Unified representations
Better reasoning

Trade-offs

Expensive to train
Complex data pipeline
Harder to iterate

2025 Trend: Native multimodal models (Gemini, GPT-4o, Claude 3) dominate benchmarks by training jointly on all modalities from scratch. However, encoder fusion approaches (LLaVA-style) remain popular for fine-tuning since you can swap components independently. The gap is narrowing as techniques like visual tokenization improve.

The 2025 Landscape

The performance gaps between frontier models are shrinking. Gemini 3 Pro leads reasoning benchmarks (91.9% GPQA Diamond) and offers the largest context window (1M tokens). GPT-5.1 leads multimodal understanding (84.2% MMMU). Claude 4.5 excels at long-form analysis and alignment.

The practical takeaway: organizations increasingly deploy multiple models, routing queries to the optimal one per task. The era of "one model to rule them all" is giving way to intelligent orchestration.

Video Understanding: The Next Frontier

Video is the final frontier. Unlike images, video requires understanding temporal relationships—cause and effect, motion, narrative structure. The data requirements are immense (video is orders of magnitude larger than text), and the compute requirements scale accordingly.

Early video LLMs sample frames (treating video as a sequence of images), but true video understanding requires modeling motion and time. Gemini's ability to process hours of video hints at what's coming. The applications— video search, automated editing, security analysis—are enormous.

Chapter 18

The Agentic Era

The most significant shift in 2024-2025 wasn't a new architecture or training technique—it was a change in how we use models. LLMs evolved from chatbots that respond to prompts into agentic systems that can take actions.

From Chatbots to Agents

A chatbot takes input and produces output. An agent operates in a loop: observe the environment, think about what to do, take an action, observe the result, repeat. The difference seems subtle, but the implications are profound.

Function calling was the enabling technology. Modern LLMs can decide when to invoke external tools—web search, code execution, database queries, API calls—and incorporate the results into their reasoning. The model doesn't just talk about doing things; it actually does them.

Model Context Protocol (MCP)

Model Context Protocol emerged as the standardization layer. Introduced by Anthropic in November 2024, adopted by OpenAI and Google by March 2025, and donated to the Linux Foundation by December 2025, MCP provides a universal interface for AI-tool integration.

Before MCP, every tool integration was custom—specific prompts, specific output formats, specific error handling. MCP standardizes tool discovery (agents query what's available at runtime), invocation format, and security boundaries. Thousands of MCP servers now exist, from file systems to databases to cloud APIs.

Interactive: Agent Tool Calling Demo

Watch how an agent reasons through a multi-step task, deciding which tools to call and incorporating results into its reasoning.

Agent Tool Calling Demo

Watch how agents reason and use tools

Available Tools (via MCP)

get_weather

convert_temperature

search_database

execute_code

USER QUERY

What's the weather in Tokyo and convert the temperature to Fahrenheit?

Click "Run Agent" to see the reasoning and tool-calling process

The Agent Loop

Observe

Think

Act

Observe

Agents iterate until they can provide a final response

The Agentic Paradigm: Modern LLMs aren't just chatbots—they're reasoning engines that can invoke tools, execute code, and orchestrate complex workflows. The key insight: the model decides when and what to call based on its reasoning. This enables AI systems that can actually do things, not just talk about them.

Code Execution: The Efficiency Breakthrough

An interesting development: LLMs are better at writing code to use tools than calling tools directly. Anthropic's research showed that having the model write Python code to orchestrate MCP calls reduced token usage by 98.7% compared to direct tool calling. The model writes a small program, the system executes it, and only the result comes back.

This has profound implications for context efficiency. Instead of maintaining thousands of tokens of tool schemas in the context window, the model loads tools on demand through code. The context stays focused on the actual problem.

The Reliability Challenge

Agentic systems face a reliability problem. If each step has 95% accuracy, a 20-step task has only 36% end-to-end success. This is why reasoning models matter for agents: better reasoning → fewer errors per step → higher compound reliability.

The security implications are also serious. Prompt injection attacks can hijack an agent to call unintended tools. MCP includes security boundaries, but the standard advice remains: always keep a human in the loop for consequential actions.

Chapter 19

Looking Forward

The Open vs. Closed Divide

2025 was the year open-source closed the gap. DeepSeek-R1 matches o1 on reasoning benchmarks. LLaMA 3.1 405B rivals GPT-4. Qwen 2.5 models dominate various specialized tasks. The "GPT-4 moat" that seemed unassailable in 2023 looks much shorter now.

The dynamics have shifted. Proprietary labs still lead on infrastructure (scale, RLHF, multimodal training), but open models enable research that was previously impossible. We can actually study how reasoning emerges, how attention patterns develop, how knowledge is stored. Science needs reproducibility, and open models provide it.

What I'm Watching

Inference-time scaling: The test-time compute paradigm is still young. Better reasoning algorithms, more efficient tree search, smarter allocation of thinking budget—there's significant room to improve.

Architecture diversity: SSMs, Transformers, hybrids, and architectures we haven't invented yet. The design space is larger than we thought. I expect 2026 to bring new contenders.

Efficiency at the limit: 1-bit models running on CPUs, FP4 training becoming standard, aggressive pruning and sparsity. The goal: GPT-4 class capability on a laptop.

Agentic reliability: Right now, agents are impressive demos but unreliable in production. The companies that solve compound reliability—making 100-step tasks work consistently—will define the next wave of applications.

Final Thoughts

I started programming on a Commodore 64, typing in BASIC listings from magazines. The idea that I'd one day be explaining how artificial systems can reason, use tools, and generate coherent thought would have seemed like science fiction.

Yet here we are. The core techniques—matrix multiplications, softmax, layer norm—would be familiar to any linear algebra student. The magic is in the combination, the scale, and increasingly, the inference-time algorithms that turn raw capability into reliable reasoning.

What excites me most is that we're still in early days. The "Attention Is All You Need" paper is less than a decade old. Mamba is from 2023. Reasoning models emerged in 2024. The techniques I've described here will likely seem primitive in another decade.

The fundamentals, though—understanding the memory hierarchy, the parallelism trade-offs, the mathematical foundations—those will remain useful. Learn the principles, not just the current instantiations.

And keep building. That's still the best way to really understand.

Quick Reference

Key Formulas

Attention:

Attention(Q, K, V) = softmax(QK^T / √d_k) × V

KV-Cache Memory:

2 × layers × kv_heads × head_dim × seq_len × batch × bytes

Model Weights:

params × bytes_per_param (FP16=2, INT8=1, INT4=0.5)

SSM State Update:

h(t) = Ah(t-1) + Bx(t), y(t) = Ch(t) + Dx(t)

Chinchilla Optimal:

~20 tokens per parameter for compute-optimal training

2025 Model Architectures

Model	Total Params	Active	Architecture	Context
LLaMA 3.1 8B	8B	8B	Dense Transformer	128K
LLaMA 3.1 405B	405B	405B	Dense Transformer	128K
Mixtral 8x22B	141B	39B	MoE (8 experts)	64K
DeepSeek-V3	671B	37B	MoE (256 experts)	128K
DeepSeek-R1	671B	37B	MoE + Reasoning	128K
Jamba 1.5	398B	94B	Hybrid SSM+Attention	256K
Mamba-2 7B	7B	7B	Pure SSM	∞*

* SSMs have O(1) memory during inference, theoretically unlimited context

Architecture Complexity

Architecture	Training	Inference (per token)	Memory
Standard Transformer	O(n²)	O(n)	O(n) KV-cache
Flash Attention	O(n²) compute, O(n) memory	O(n)	O(n) KV-cache
Sliding Window	O(n × w)	O(w)	O(w) fixed cache
SSM (Mamba)	O(n)	O(1)	O(1) state
Hybrid (Jamba)	O(n)	O(n)	O(n) reduced

Series Summary

Part I: Setting the Stage

RNNs, vanishing gradients, the attention mechanism, self-attention, multi-head attention, positional encoding, and the training objective.

Part II: From Transformer to Modern LLMs

Decoder-only architectures, tokenization, RoPE/ALiBi, GQA/MQA, Mixture of Experts, Flash Attention, distributed training, inference optimization, and long context handling.

Part III: What's Emerging

Reasoning models and test-time compute, State Space Models (Mamba), efficiency innovations (FP4, BitNet, distillation), multimodal convergence, agentic AI and MCP, and the future of LLM architectures.