LLM Deep Dive • Part III of III
What's Emerging
The cutting edge of LLM development: reasoning models, state space architectures, extreme efficiency, and the rise of agentic AI. Plus a comprehensive reference guide.
The Reasoning Revolution
In September 2024, something shifted. OpenAI released o1, and suddenly we had a "whole new knob" to turn—one that traded inference compute for capability. The field hasn't been the same since.
The core insight behind test-time compute scaling is deceptively simple: let the model think longer before answering. Instead of immediately producing a response, the model generates internal reasoning traces—sometimes thousands of tokens of "thinking" before producing a final answer.
Chain-of-Thought and Beyond
Chain-of-thought prompting showed us that intermediate reasoning improves results. Tell a model to "think step by step," and accuracy on math problems jumps significantly. But that was prompting—we were asking the model to show its work.
What o1 and DeepSeek-R1 demonstrated is that you can train models specifically for extended reasoning using reinforcement learning from verifiable rewards (RLVR). The model learns when to think more, when to backtrack, and when it has enough evidence to commit to an answer.
Tree-of-thought takes this further: instead of a linear chain, the model explores multiple reasoning paths in parallel. It's like the difference between following a single trail through the woods versus exploring several forks and choosing the best route.
Interactive: Reasoning Strategies Compared
Compare how different reasoning approaches affect accuracy and token usage. Animate the thinking process to see each strategy in action.
Reasoning Strategies Compared
See how different reasoning approaches affect accuracy
If a train travels 120 miles in 2 hours, then stops for 30 minutes, then travels 90 miles in 1.5 hours, what is its average speed for the entire journey?
Let me break this down step by step.
First segment: 120 miles in 2 hours
Second segment: 90 miles in 1.5 hours
Rest period: 30 minutes = 0.5 hours
Total distance = 120 + 90 = 210 miles
Total time = 2 + 0.5 + 1.5 = 4 hours
Average speed = 210 ÷ 4 = 52.5 mph
The average speed for the entire journey is 52.5 mph.
The Test-Time Compute Trade-off
Here's what I find fascinating: we now have two scaling laws to play with. The original Chinchilla-era scaling law told us how much data to use for a given amount of training compute. The new inference-time scaling law tells us how much thinking to do for a given problem difficulty.
The trade-offs are real. A reasoning model might take 10-30 seconds to answer a complex question that a standard model handles in 1 second. But on difficult problems—competition math, formal proofs, complex code—the accuracy gains are dramatic. DeepSeek-R1 achieves ~80% on AIME (American Invitational Mathematics Examination) problems. That would put it in the top tier of human competitors.
Interactive: Test-Time Compute Scaling
Adjust the "thinking budget" to see how accuracy improves with more inference compute. Notice the diminishing returns at high token counts.
Test-Time Compute Scaling
More thinking time → better answers (with trade-offs)
Open-Source Reasoning: DeepSeek-R1
DeepSeek-R1 (January 2025) was a milestone for open-source AI. It demonstrated that reasoning capabilities can emerge from pure reinforcement learning—no supervised fine-tuning step required. The model learns to "think" through trial and error on verifiable tasks.
The architecture is elegant: 671B total parameters using Mixture of Experts, but only 37B active per token. Training cost? Around $5.5M—a fraction of GPT-4's estimated budget. The distilled variants (ranging from 1.5B to 70B parameters) make reasoning accessible on consumer hardware.
What excites me most is that this is now open. Researchers can study how reasoning emerges, what the thinking traces look like, and how to improve them. The proprietary advantage of reasoning models just narrowed significantly.
State Space Models — Beyond the Transformer
For seven years, Transformers have been the only game in town for serious language modeling. Mamba, introduced in late 2023, changed that. For the first time, we have a fundamentally different architecture that can match Transformer quality on language tasks.
The Efficiency Promise
Remember the quadratic attention problem from Part II? Standard attention is O(n²) in both compute and memory. At 1M tokens, that's a trillion attention elements per layer. Even with Flash Attention and sparse patterns, we're fighting the architecture.
State Space Models take a different approach. Instead of computing attention over all positions, they maintain a fixed-size hidden state that gets updated as tokens are processed. Training complexity: O(n). Inference memory: O(1) per token. The state acts as a "compressed summary" of all previous tokens.
Selective State Spaces: The Mamba Innovation
Classic SSMs have a problem: the state transition is fixed (input-independent). This means they compress everything the same way, regardless of content. Mamba introduces selective state spaces: the state transition depends on the input, allowing the model to selectively remember relevant information and forget irrelevant details.
The results speak for themselves: Mamba matches equivalently-sized Transformers on language modeling benchmarks while being significantly faster. On byte-level sequences (where sequences are ~4× longer than BPE), Mamba dramatically outperforms Transformers—the efficiency advantage compounds with sequence length.
Interactive: SSM vs Transformer Architecture
Compare how Transformers and SSMs scale with sequence length. Adjust the sequence length to see relative compute and memory requirements.
SSM vs Transformer Architecture
Compare scaling characteristics of different architectures
The Hybrid Future
Transformers aren't going anywhere. They're exceptionally good at precise local attention—knowing exactly which token 50 positions back is relevant to the current token. SSMs excel at efficient long-range context but can lose precision on specific lookups.
Hybrid architectures like Jamba combine both: SSM layers handle the bulk of the sequence efficiently, while periodic attention layers provide the precision needed for specific retrieval. IBM's Granite 4.0 and AI21's Jamba series use this approach.
Mamba-2 (2024) made an even more interesting discovery: there's a deep mathematical connection between SSMs and attention. They're not as different as they appear—just different parameterizations of similar operations. This "state space duality" suggests we're just beginning to understand the design space.
My take: we're not in the post-Transformer era yet. But for the first time, we're in the post-only-Transformers era. That's significant.
The Efficiency Frontier
Here's an economic reality: training costs are one-time, but inference costs are forever. Every query to ChatGPT, every Claude conversation, every GitHub Copilot suggestion—they all cost compute. As LLMs become ubiquitous, quantization and efficiency matter more than ever.
The Quantization Revolution
When I started working with neural networks, FP32 was standard. Then FP16 halved memory requirements with negligible quality loss. Now we're pushing further: FP8 is becoming the standard for inference (DeepSeek-V3 trains in FP8), INT4 enables running 70B models on consumer GPUs, and researchers are exploring 2-bit and even 1-bit precision.
FP4 training crossed a milestone in 2025: a 7B model trained entirely in FP4 precision matched the quality of a BF16 baseline. The key innovations were a differentiable quantization estimator and careful handling of activation outliers. This matters because FP4 training means 4× less memory pressure during training—larger models on the same hardware.
Interactive: Quantization Impact Calculator
See how different quantization levels affect memory requirements, speed, and quality. Compare what hardware you need for different model sizes.
Quantization Impact Calculator
See memory savings and quality trade-offs
The 1-Bit Frontier: BitNet
BitNet research asks: how far can we push? The "Era of 1-bit LLMs" paper showed that models with 1.58-bit weights (ternary: -1, 0, +1) can match full-precision quality—but only when trained from scratch in that format. You can't just quantize an existing model to 1-bit and expect it to work.
Why does this matter? 1-bit weights mean matrix multiplication becomes addition. No need for expensive floating-point units. In theory, 1-bit models could run efficiently on CPUs, even on edge devices. The catch: you need to train the model from scratch in this format, which few have the resources to do.
Distillation at Scale
Knowledge distillation—training a small model to mimic a large one—has become a standard technique. DeepSeek-R1's distilled variants are a perfect example: they took a 671B reasoning model and created 1.5B, 7B, 14B, and 70B versions that retain much of the reasoning capability.
The approach: generate 800,000 high-quality reasoning traces from the large model, then fine-tune smaller models (LLaMA 3.1, Qwen 2.5) on this data. The student doesn't just learn the answers—it learns the reasoning process. This is how we democratize capability: train one large model, distill to many small ones.
Combining techniques—distill a 70B to 13B, quantize to 4-bit, apply 20% sparsity—can yield 85% size reduction with less than 5% accuracy loss. The efficiency gains compound.
Multimodal Convergence
LLMs started as text models. By 2025, that framing feels outdated. The frontier models—Gemini, GPT-4o, Claude 3—process text, images, audio, and video through unified architectures. We're witnessing the convergence of perception and language.
Native vs. Fusion Architectures
There are two schools of thought for building multimodal models. Encoder fusion (LLaVA, BLIP-2, MiniCPM-V) takes pretrained components—a vision encoder like ViT, a language model—and connects them with a projection layer. Quick to build, easy to iterate, and you can swap components.
Native multimodal training (Gemini, GPT-4o) trains a single model on all modalities from scratch. This is expensive—orders of magnitude more compute—but produces tighter integration. Cross-modal reasoning is more natural because the model learned these relationships during pretraining, not through a shallow projection layer.
The benchmark gap is narrowing. MiniCPM-V, an 8B encoder-fusion model, now outperforms GPT-4V on several benchmarks while running on mobile phones. But native models still lead on complex reasoning that requires deep cross-modal understanding.
Interactive: Multimodal Architecture Explorer
Compare different approaches to multimodal AI. See which modalities each architecture supports and their trade-offs.
Multimodal Architecture Explorer
How vision, audio, and text combine in modern models
Advantages
- Deep cross-modal understanding
- Unified representations
- Better reasoning
Trade-offs
- Expensive to train
- Complex data pipeline
- Harder to iterate
The 2025 Landscape
The performance gaps between frontier models are shrinking. Gemini 3 Pro leads reasoning benchmarks (91.9% GPQA Diamond) and offers the largest context window (1M tokens). GPT-5.1 leads multimodal understanding (84.2% MMMU). Claude 4.5 excels at long-form analysis and alignment.
The practical takeaway: organizations increasingly deploy multiple models, routing queries to the optimal one per task. The era of "one model to rule them all" is giving way to intelligent orchestration.
Video Understanding: The Next Frontier
Video is the final frontier. Unlike images, video requires understanding temporal relationships—cause and effect, motion, narrative structure. The data requirements are immense (video is orders of magnitude larger than text), and the compute requirements scale accordingly.
Early video LLMs sample frames (treating video as a sequence of images), but true video understanding requires modeling motion and time. Gemini's ability to process hours of video hints at what's coming. The applications— video search, automated editing, security analysis—are enormous.
The Agentic Era
The most significant shift in 2024-2025 wasn't a new architecture or training technique—it was a change in how we use models. LLMs evolved from chatbots that respond to prompts into agentic systems that can take actions.
From Chatbots to Agents
A chatbot takes input and produces output. An agent operates in a loop: observe the environment, think about what to do, take an action, observe the result, repeat. The difference seems subtle, but the implications are profound.
Function calling was the enabling technology. Modern LLMs can decide when to invoke external tools—web search, code execution, database queries, API calls—and incorporate the results into their reasoning. The model doesn't just talk about doing things; it actually does them.
Model Context Protocol (MCP)
Model Context Protocol emerged as the standardization layer. Introduced by Anthropic in November 2024, adopted by OpenAI and Google by March 2025, and donated to the Linux Foundation by December 2025, MCP provides a universal interface for AI-tool integration.
Before MCP, every tool integration was custom—specific prompts, specific output formats, specific error handling. MCP standardizes tool discovery (agents query what's available at runtime), invocation format, and security boundaries. Thousands of MCP servers now exist, from file systems to databases to cloud APIs.
Interactive: Agent Tool Calling Demo
Watch how an agent reasons through a multi-step task, deciding which tools to call and incorporating results into its reasoning.
Agent Tool Calling Demo
Watch how agents reason and use tools
What's the weather in Tokyo and convert the temperature to Fahrenheit?
Agents iterate until they can provide a final response
Code Execution: The Efficiency Breakthrough
An interesting development: LLMs are better at writing code to use tools than calling tools directly. Anthropic's research showed that having the model write Python code to orchestrate MCP calls reduced token usage by 98.7% compared to direct tool calling. The model writes a small program, the system executes it, and only the result comes back.
This has profound implications for context efficiency. Instead of maintaining thousands of tokens of tool schemas in the context window, the model loads tools on demand through code. The context stays focused on the actual problem.
The Reliability Challenge
Agentic systems face a reliability problem. If each step has 95% accuracy, a 20-step task has only 36% end-to-end success. This is why reasoning models matter for agents: better reasoning → fewer errors per step → higher compound reliability.
The security implications are also serious. Prompt injection attacks can hijack an agent to call unintended tools. MCP includes security boundaries, but the standard advice remains: always keep a human in the loop for consequential actions.
Looking Forward
The Open vs. Closed Divide
2025 was the year open-source closed the gap. DeepSeek-R1 matches o1 on reasoning benchmarks. LLaMA 3.1 405B rivals GPT-4. Qwen 2.5 models dominate various specialized tasks. The "GPT-4 moat" that seemed unassailable in 2023 looks much shorter now.
The dynamics have shifted. Proprietary labs still lead on infrastructure (scale, RLHF, multimodal training), but open models enable research that was previously impossible. We can actually study how reasoning emerges, how attention patterns develop, how knowledge is stored. Science needs reproducibility, and open models provide it.
What I'm Watching
Inference-time scaling: The test-time compute paradigm is still young. Better reasoning algorithms, more efficient tree search, smarter allocation of thinking budget—there's significant room to improve.
Architecture diversity: SSMs, Transformers, hybrids, and architectures we haven't invented yet. The design space is larger than we thought. I expect 2026 to bring new contenders.
Efficiency at the limit: 1-bit models running on CPUs, FP4 training becoming standard, aggressive pruning and sparsity. The goal: GPT-4 class capability on a laptop.
Agentic reliability: Right now, agents are impressive demos but unreliable in production. The companies that solve compound reliability—making 100-step tasks work consistently—will define the next wave of applications.
Final Thoughts
I started programming on a Commodore 64, typing in BASIC listings from magazines. The idea that I'd one day be explaining how artificial systems can reason, use tools, and generate coherent thought would have seemed like science fiction.
Yet here we are. The core techniques—matrix multiplications, softmax, layer norm—would be familiar to any linear algebra student. The magic is in the combination, the scale, and increasingly, the inference-time algorithms that turn raw capability into reliable reasoning.
What excites me most is that we're still in early days. The "Attention Is All You Need" paper is less than a decade old. Mamba is from 2023. Reasoning models emerged in 2024. The techniques I've described here will likely seem primitive in another decade.
The fundamentals, though—understanding the memory hierarchy, the parallelism trade-offs, the mathematical foundations—those will remain useful. Learn the principles, not just the current instantiations.
And keep building. That's still the best way to really understand.
Quick Reference
Key Formulas
2025 Model Architectures
| Model | Total Params | Active | Architecture | Context |
|---|---|---|---|---|
| LLaMA 3.1 8B | 8B | 8B | Dense Transformer | 128K |
| LLaMA 3.1 405B | 405B | 405B | Dense Transformer | 128K |
| Mixtral 8x22B | 141B | 39B | MoE (8 experts) | 64K |
| DeepSeek-V3 | 671B | 37B | MoE (256 experts) | 128K |
| DeepSeek-R1 | 671B | 37B | MoE + Reasoning | 128K |
| Jamba 1.5 | 398B | 94B | Hybrid SSM+Attention | 256K |
| Mamba-2 7B | 7B | 7B | Pure SSM | ∞* |
* SSMs have O(1) memory during inference, theoretically unlimited context
Architecture Complexity
| Architecture | Training | Inference (per token) | Memory |
|---|---|---|---|
| Standard Transformer | O(n²) | O(n) | O(n) KV-cache |
| Flash Attention | O(n²) compute, O(n) memory | O(n) | O(n) KV-cache |
| Sliding Window | O(n × w) | O(w) | O(w) fixed cache |
| SSM (Mamba) | O(n) | O(1) | O(1) state |
| Hybrid (Jamba) | O(n) | O(n) | O(n) reduced |
Series Summary
RNNs, vanishing gradients, the attention mechanism, self-attention, multi-head attention, positional encoding, and the training objective.
Decoder-only architectures, tokenization, RoPE/ALiBi, GQA/MQA, Mixture of Experts, Flash Attention, distributed training, inference optimization, and long context handling.
Reasoning models and test-time compute, State Space Models (Mamba), efficiency innovations (FP4, BitNet, distillation), multimodal convergence, agentic AI and MCP, and the future of LLM architectures.