Beyond Classical ML: A Field Guide to Frontier Models, New Terminology, and the Modern Fine-Tuning Stack
A practitioner's map to the vocabulary and techniques that emerged once "machine learning" became "the LLM era."
TL;DR
- "Frontier model" no longer means one thing. By 2026 the term has fractured into a regulatory frontier (compute-threshold triggers like 10²⁶ FLOPs), an efficiency frontier (best capability-per-parameter), and a cost frontier (best capability-per-dollar at inference).
- Pretraining scale and frontier capability have decoupled. Post-training (SFT → preference optimization → RL) and test-time compute now contribute as much practical capability as raw parameter count did in the 2020–2023 era.
- Reasoning models (LRMs) are a distinct architecture-of-use, not just bigger LLMs — they trade inference latency for accuracy via chain-of-thought and verifiable rewards (RLVR).
- Fine-tuning is a stack, not a single method. You choose: how many parameters to touch (full FT vs. PEFT), the supervision shape (SFT vs. preference-based vs. RL), and how you'll evaluate it. LoRA + SFT is the 2026 default; QLoRA when memory-constrained; DPO is the default preference-tuning step; GRPO/RLVR is reserved for reasoning-heavy domains.
- Fine-tuning shapes behavior, not knowledge. If the gap is facts, reach for RAG first. If the gap is form (tone, structure, refusal behavior, reasoning style), fine-tune.
1. Why the Vocabulary Changed
Classical ML had a fairly stable vocabulary: train/val/test splits, bias-variance tradeoff, feature engineering, cross-validation, regularization. That vocabulary still applies underneath everything — but a second layer of terminology has been bolted on top of it since 2022, built around three structural shifts in how capable models actually get built:
- Pretraining became commoditized. Scaling a transformer on internet text is now closer to "systems engineering" (data cleaning, MoE routing, FP8 precision, GPU scheduling) than to research novelty. The competitive edge in 2026 plays out on three axes: data cleaning, systems engineering, and modular post-training, with architecture and paper-friendly new ideas mattering less than before.
- Post-training became the real product surface. Post-training optimization — through reinforcement learning from human feedback, instruction fine-tuning, and preference learning — now delivers substantial practical capability improvements at a fraction of pretraining cost, meaning much of what users experience as "model capability" reflects post-training work rather than raw scale.
- Inference became a second training budget. Test-time compute scaling — the architectural insight underlying both DeepSeek-R1 and OpenAI's o1 — relocates compute from training time to inference time, letting a model reason step-by-step before answering, so a smaller, cheaper model with enough inference budget can outperform a larger, costlier one on reasoning tasks.
Everything below the line is downstream of these three shifts.
2. "Frontier Model" — One Term, Three Frontiers
The phrase "frontier model" used to just mean "biggest, newest, best." That single definition has split.
Figure 1 — The Three Frontiers of 2026
┌─────────────────────────┐ │ "FRONTIER" (2026) │ └────────────┬─────────────┘ ┌────────────────────┼────────────────────┐ ▼ ▼ ▼ ┌───────────────────┐ ┌──────────────────────┐ ┌──────────────────────┐ │ REGULATORY │ │ EFFICIENCY │ │ COST │ │ FRONTIER │ │ FRONTIER │ │ FRONTIER │ ├───────────────────┤ ├──────────────────────┤ ├──────────────────────┤ │ Crosses compute │ │ Flagship reasoning │ │ Best price- │ │ thresholds │ │ with streamlined │ │ performance at │ │ (e.g. 10^26 FLOPs) │ │ architecture │ │ inference time │ │ → triggers reporting │ │ → Pareto-optimal │ │ → makes high-volume │ │ obligations │ │ design │ │ deployment viable │ └───────────────────┘ └──────────────────────┘ └──────────────────────┘
In 2026, the idea of a single frontier has fractured into several overlapping frontiers: a regulatory frontier of models that cross formal compute thresholds and trigger reporting obligations under emerging regulation; an efficiency frontier of models achieving flagship-level reasoning with streamlined architectures; and a cost frontier of models that push inference costs down enough to make high-volume deployment economical. A model can lead on one frontier and trail on another — which is precisely why "is X the frontier model?" is increasingly the wrong question to ask.
This matters practically: if you need the absolute best reasoning available today, closed models still tend to lead; if you need cost efficiency, control, and customization, open-weight models increasingly dominate; if you need both, a hybrid approach is often the more resilient long-term strategy.
Table 1 — Frontier Model Landscape (Illustrative, as of early 2026)
| Frontier dimension | Defining trait | Representative approach | Practical implication for teams |
|---|---|---|---|
| Regulatory | Crosses a formal compute/FLOP threshold | Flagship closed-lab training runs | Triggers disclosure/compliance obligations; relevant to procurement & legal review |
| Efficiency | Best reasoning-per-parameter | Streamlined MoE / hybrid architectures | Good fit when you need flagship behavior on constrained hardware |
| Cost | Best reasoning-per-dollar at inference | Distilled or quantized open-weight models | Good fit for high-volume production traffic |
Note: specific model names and benchmark scores move on a roughly monthly cadence — treat any leaderboard snapshot as perishable and re-verify before publishing.
3. New Terminology Glossary: Post-ML, Pre-AGI
This is the vocabulary that didn't really exist (or meant something narrower) in the classical-ML era. Group by where in the lifecycle each term lives.
Table 2 — Term-by-Term Glossary
| Term | Plain-English definition | Where it sits in the pipeline |
|---|---|---|
| Foundation model | A model trained on broad data, general-purpose enough to be adapted to many downstream tasks | Pretraining output |
| Frontier model | The most capable model(s) along a given dimension (regulatory/efficiency/cost) at a point in time | Cross-cutting, time-relative |
| LRM (Large Reasoning Model) | A model architecture-of-use that generates extended chain-of-thought before answering, trading latency for accuracy | Inference-time behavior, enabled by post-training |
| RLVR (RL with Verifiable Rewards) | Reinforcement learning where the reward is a binary correctness signal (e.g., "did the code pass the test," "does the math check out") rather than a learned human-preference model | Post-training |
| GRPO (Group Relative Policy Optimization) | An RL algorithm that scores a group of sampled responses against each other (relative ranking) instead of needing a separate value/critic network — the technique behind DeepSeek-R1 | Post-training (RL stage) |
| Test-time compute scaling | Spending more inference compute (longer chains-of-thought, search, voting) instead of more training compute to improve answer quality | Inference |
| Self-consistency / majority voting | Sampling multiple reasoning paths and taking the most common final answer | Inference-time decoding strategy |
| Hallucination paradox | Forcing SFT on facts the model never saw in pretraining teaches it the shape of a confident answer before it has the underlying truth | Post-training failure mode |
| Context engineering | Deliberately designing what information is placed in the context window (vs. relying on parametric memory) to compensate for what post-training cannot create from nothing | Inference-time / system design |
| Model merging | Combining the parameters of two or more fine-tuned models (rather than retraining jointly) to combine their specialized strengths | Post-training / deployment |
| Model distillation | Training a smaller "student" model to imitate a larger "teacher" model's outputs (or reasoning traces) | Post-training |
| Reasoning distillation | Distillation specifically of chains-of-thought from a reasoning model into a smaller or non-reasoning model | Post-training |
| Sovereign AI / inference-as-infrastructure | The trend of treating model inference (rather than the one-time training run) as the durable, strategically important infrastructure layer | Deployment / strategy |
| MoE (Mixture of Experts) | An architecture that routes each token to a subset of specialized "expert" sub-networks instead of activating the whole model | Architecture |
Figure 2 — Where New Terminology Sits in the Model Lifecycle
PRETRAINING POST-TRAINING INFERENCE ─────────── ────────────── ───────── Foundation model → SFT → Preference opt. → RLVR → Test-time compute MoE architecture (DPO/IPO/KTO) (GRPO) scaling │ Self-consistency ├── Hallucination paradox Context engineering │ (risk if SFT outpaces │ pretraining knowledge) ▼ Model merging / distillation ──────► LRM behavior (compress, combine, specialize) at inference
4. Why "Reasoning Model" Became Its Own Category
Until 2024, "bigger model = better at everything" was the rough heuristic. Test-time reasoning was the key innovation that enabled LLM-based systems to begin exhibiting non-zero fluid intelligence, giving rise to the LRM (Large Reasoning Model) paradigm, first demonstrated by OpenAI's o1 and o3 systems. The mechanism: instead of mapping prompt → answer in one forward pass, the model is trained (via RLVR) to generate a long internal chain-of-thought, and that extra inference compute is what buys the accuracy gain — not more parameters.
DeepSeek-R1 demonstrated that rule-based, verifiable rewards via GRPO can elicit self-correction and reflection in a model without dense human supervision, and this paradigm has since been extended across the Qwen and Kimi model families, and integrated into Gemini and Llama-Nemotron systems.
This is also why "RLHF" as a label has gotten slippery. The term RLHF is still used loosely to refer to any kind of preference-based alignment, but by 2026 its precise meaning has shifted toward more specific successor techniques — DPO and its relatives have mostly replaced the classic PPO-based RLHF pipeline for preference tuning, while GRPO/RLVR have taken over the "make it reason better on verifiable tasks" job.
The Hallucination Paradox (worth its own callout)
If, during SFT, a model is forced to answer questions about facts it never encountered in pretraining, it is effectively being trained to "lie with confidence" — it learns the shape of a correct answer before it possesses the underlying fact, and alignment in a technical sense does not teach new information, it only selects which of the personas latent in pretraining the model should impersonate. This is the single best argument for routing factual gaps to retrieval (RAG) rather than fine-tuning — a point that resurfaces constantly in the fine-tuning section below.
5. The Fine-Tuning Stack: A Taxonomy
The single biggest mental model shift: fine-tuning in 2026 isn't a method, it's three independent choices stacked on top of each other.
Figure 3 — The Three Independent Axes of a Fine-Tuning Decision
AXIS 1: HOW MANY PARAMETERS? AXIS 2: WHAT SUPERVISION SHAPE? AXIS 3: HOW DO YOU EVALUATE? ─────────────────────────── ────────────────────────────── ───────────────────────────── • Feature-based / head-only • SFT (imitate examples) • Held-out eval set • Full fine-tuning (all weights) • Preference opt. (DPO/IPO/KTO/ORPO) • Pairwise human/model judge • PEFT: LoRA / QLoRA / DoRA / GaLore• RL w/ verifiable reward (GRPO/RLVR)• Task-specific benchmark • A/B in production
Fine-tuning in 2026 is no longer a single decision — it is a stack: pick a base model, pick how many parameters you will touch, pick the supervision shape, and pick how you will evaluate the result.
5.1 Axis 1 — How Many Parameters Do You Touch?
| Method | What it does | Trainable params | Quality vs. full FT | Best for |
|---|---|---|---|---|
| Full fine-tuning | Updates every weight in the model | 100% | Baseline (100%) | Maximum quality when budget is no constraint; risks catastrophic forgetting |
| Feature-based / head-only | Freezes the backbone, trains only a final layer/head | <0.1% | Lower; limited expressiveness | Lightweight classification on top of a frozen encoder |
| LoRA | Injects trainable low-rank matrices A, B into attention/MLP projections; base frozen | ~0.1–1% of base model size | 90–95% | Default starting point for 2026 fine-tuning workflows |
| QLoRA | LoRA on top of a 4-bit (NF4) quantized frozen base, with paged optimizers | Same as LoRA | 80–90% | Memory-constrained training (consumer GPUs, 70B-class models on one card) |
| DoRA | Decomposes the weight update into magnitude + direction components | Similar to LoRA | Marginal gain over LoRA | When LoRA underperforms on complex domain shift |
| GaLore | Gradient low-rank projection; reduces optimizer memory without adapters | Varies | Close to full FT on some tasks | Memory-constrained full fine-tuning (not adapter-based) |
Figure 4 — LoRA Mechanics
Frozen base weight W (d × d) │ │ (forward pass uses W + ΔW) ▼ ΔW = B · A where B: d×r, A: r×d, r ≪ d ▲ │ Only A and B are trained (rank r is the key hyperparameter: r=8 small tasks, r=16 sensible default, r=32–64 complex/multi-task)
In practice, three settings drive most of the variation: r=8 is enough capacity for simple style or format tasks; r=16 is the recommended default for instruction fine-tuning and most domain specialization; r=32–64 is reserved for complex domain shift, multi-task training, or datasets above ~10,000 examples. A common heuristic is to start at r=16, establish a working baseline, and only raise the rank if validation loss stagnates, setting alpha equal to the rank or double it.
5.2 Axis 2 — What Supervision Shape?
This is the axis classical ML had almost nothing equivalent to — it's specific to instruction-following and alignment.
| Method | Signal used | What it optimizes for | Notes |
|---|---|---|---|
| SFT (Supervised Fine-Tuning) | Labeled input→output pairs (imitation) | Mimicking demonstrated behavior | "You gave the model input-output pairs, it learned to mimic your data, done" — the simplest and still most common first step |
| DPO (Direct Preference Optimization) | Pairs of (chosen, rejected) responses | Preferring better over worse responses directly, no reward model needed | Replaces the older SFT-then-PPO pipeline for most teams; simpler to implement, competitive quality |
| IPO / KTO / ORPO | Variants of preference data (KTO uses unpaired binary feedback) | Same family goal as DPO with different stability/data assumptions | "Cousins" of DPO that address specific failure modes of the original formulation |
| RLHF (classic, PPO-based) | Learned reward model trained on human preferences, optimized via PPO | Preference alignment via full RL loop | Now an escalation path rather than a default — DPO and full fine-tuning are escalation paths, not defaults |
| RLVR / GRPO | Binary, verifiable correctness signal (unit test passed, math checked) | Improving accuracy on verifiable reasoning tasks | A simplified, scalable form of bootstrapped reasoning that uses answer-matching or constraint verification as a training signal, letting an LLM learn to reason better, especially in math and code |
Figure 5 — Supervision Shape, Visualized
SFT DPO/IPO/KTO/ORPO RLVR / GRPO ──── ────────────────── ─────────── "Do this." "Prefer this over that." "Did this check out? +1 / -1" │ │ │ ▼ ▼ ▼ Imitation Pairwise preference Verifiable, rule-based learning optimization reward (no human in loop) (cheap, fast, (no separate reward (best for math/code/ but ceiling'd model needed, stable) structured reasoning) by demo quality)
5.3 Axis 3 — Knowledge vs. Behavior: The Gate Before You Start
The most important practical filter, and one that gets skipped constantly:
The 3-question gate before committing budget: (1) is the gap knowledge or behavior? (2) have you exhausted prompting and retrieval? (3) can you measure success with an eval set? If any answer is "no," stop before you train.
Fine-tuning is for form, not facts — it shapes behavior, style, structured output, and refusal patterns, but it is not the right tool for injecting knowledge that changes weekly; the right sequence in 2026 is prompt → RAG → fine-tune → distill, with the highest-ROI fine-tuning typically being a thin LoRA/QLoRA adapter layered on top of a strong base model, paired with retrieval rather than replacing it.
Figure 6 — The Decision Funnel
Is performance bad? │ ▼ Fix the prompt first ──── still bad? ──┐ ▼ Build/improve RAG ──── still bad, and gap is BEHAVIOR not facts? ──┐ ▼ Fine-tune (LoRA/QLoRA + SFT, then DPO if preference-shaping is needed, then GRPO if it's a verifiable-reasoning task) │ ▼ Distill into a smaller model for cheaper deployment, once the recipe is proven
5.4 Putting It Together — Common Recipes
| Goal | Recommended stack | Why |
|---|---|---|
| Adapt tone/format/domain style on a budget | QLoRA + SFT | A 7B model fine-tunes for under $5 on rented GPU infrastructure; this is the cheapest path to a working baseline |
| General instruction-following + safety alignment | LoRA + SFT, then DPO | "The default starting point is LoRA + SFT, with QLoRA when memory is tight and DPO on top for preference alignment" |
| Better math/code reasoning | SFT (cold start) → GRPO/RLVR | "Layer in GRPO if you care about reasoning" on top of a working SFT baseline |
| Cheap deployment of a proven reasoning recipe | Reasoning distillation from a larger teacher | Transfers a larger model's reasoning ability into a smaller student, often combined with chain-of-thought compression so the student is equipped for efficient reasoning |
| Combine multiple specialized fine-tunes without joint retraining | Model merging | Preserves overall performance while integrating contributions from independent workstreams, avoiding the coordination cost of a joint training run |
6. Sample Sizes and Cost — A Sanity-Check Table
A common question after the method is chosen: how much data, and how much will it cost?
Table 3 — Rough Data & Cost Anchors (treat as ballpark, not spec)
| Scenario | Data needed | Hardware | Rough cost |
|---|---|---|---|
| Noticeable task-specific improvement (LoRA/QLoRA) | 500–2,000 high-quality examples | Single consumer/cloud GPU | Low — often single-digit dollars to low hundreds |
| Meaningful behavioral change | 5,000–10,000 examples | Single GPU, longer run | Tens of dollars |
| QLoRA fine-tune of a 7B model on a consumer card | Task-dependent | A single RTX 4070 Ti-class GPU, in an afternoon | Hardware you may already own |
| Full fine-tuning of a 7B model | Task-dependent | 100–120 GB VRAM, roughly $50,000 of H100 GPUs for one run | High |
| Genuine frontier-scale pretraining run | N/A | 64–256+ H100-class GPUs | €200,000 to €2 million+ for a single run, even at 7–13B parameter scale |
The takeaway embedded in this table: the gap between "can't afford this" and "can afford this" used to be a research-team problem; it is now almost entirely a method-selection problem.
7. Putting the Whole Picture Together
Figure 7 — End-to-End Lifecycle Map
PRETRAIN → POST-TRAIN → SERVE / INFER ──────── ─────────── ───────────── Foundation model SFT (imitation) Test-time compute (MoE, FP8, │ scaling data pipelines) ▼ Self-consistency / │ Preference opt. majority voting │ (DPO/IPO/KTO/ORPO) Context engineering │ │ │ │ ▼ ▼ │ RLVR / GRPO LRM behavior at │ (verifiable reasoning) inference, agentic │ │ tool use │ ▼ └──────────► Distillation / Model merging ────────► Cheaper, specialized (compress & combine recipes) deployable models
This is the throughline worth taking away: the center of gravity in the field has moved from pretraining to post-training to inference-time behavior, and the new vocabulary above is essentially a map of that migration. Knowing where a term sits on this lifecycle (pretrain / post-train / inference) is usually enough to know what it's for, even before you know the exact mechanics.
Further Reading / Source Threads to Track
- DeepSeek-R1 and the GRPO mechanism (the canonical RLVR reference point)
- Hugging Face's PEFT and TRL libraries (LoRA/QLoRA/DPO/GRPO reference implementations)
- ARC-AGI-2/3 as the benchmark series that tracks "fluid reasoning" progress specifically, separate from raw knowledge benchmarks
- "SFT memorizes, RL generalizes" (Chu et al.) — the empirical case for RLVR over pure SFT on reasoning tasks
This post reflects publicly reported terminology and techniques as of mid-2026. Model names, benchmark scores, and specific cost figures move quickly — verify against current sources before citing numbers in anything load-bearing.