Beyond Classical ML: A Field Guide to Frontier Models, New Terminology, and the Modern Fine-Tuning Stack

June 25, 2026 at 04:07 PM

5 min read

A practitioner's map to the vocabulary and techniques that emerged once "machine learning" became "the LLM era."

TL;DR

"Frontier model" no longer means one thing. By 2026 the term has fractured into a regulatory frontier (compute-threshold triggers like 10²⁶ FLOPs), an efficiency frontier (best capability-per-parameter), and a cost frontier (best capability-per-dollar at inference).
Pretraining scale and frontier capability have decoupled. Post-training (SFT → preference optimization → RL) and test-time compute now contribute as much practical capability as raw parameter count did in the 2020–2023 era.
Reasoning models (LRMs) are a distinct architecture-of-use, not just bigger LLMs — they trade inference latency for accuracy via chain-of-thought and verifiable rewards (RLVR).
Fine-tuning is a stack, not a single method. You choose: how many parameters to touch (full FT vs. PEFT), the supervision shape (SFT vs. preference-based vs. RL), and how you'll evaluate it. LoRA + SFT is the 2026 default; QLoRA when memory-constrained; DPO is the default preference-tuning step; GRPO/RLVR is reserved for reasoning-heavy domains.
Fine-tuning shapes behavior, not knowledge. If the gap is facts, reach for RAG first. If the gap is form (tone, structure, refusal behavior, reasoning style), fine-tune.

1. Why the Vocabulary Changed

Classical ML had a fairly stable vocabulary: train/val/test splits, bias-variance tradeoff, feature engineering, cross-validation, regularization. That vocabulary still applies underneath everything — but a second layer of terminology has been bolted on top of it since 2022, built around three structural shifts in how capable models actually get built:

Pretraining became commoditized. Scaling a transformer on internet text is now closer to "systems engineering" (data cleaning, MoE routing, FP8 precision, GPU scheduling) than to research novelty. The competitive edge in 2026 plays out on three axes: data cleaning, systems engineering, and modular post-training, with architecture and paper-friendly new ideas mattering less than before.
Post-training became the real product surface. Post-training optimization — through reinforcement learning from human feedback, instruction fine-tuning, and preference learning — now delivers substantial practical capability improvements at a fraction of pretraining cost, meaning much of what users experience as "model capability" reflects post-training work rather than raw scale.
Inference became a second training budget. Test-time compute scaling — the architectural insight underlying both DeepSeek-R1 and OpenAI's o1 — relocates compute from training time to inference time, letting a model reason step-by-step before answering, so a smaller, cheaper model with enough inference budget can outperform a larger, costlier one on reasoning tasks.

Everything below the line is downstream of these three shifts.

2. "Frontier Model" — One Term, Three Frontiers

The phrase "frontier model" used to just mean "biggest, newest, best." That single definition has split.

Figure 1 — The Three Frontiers of 2026

┌─────────────────────────┐ │ "FRONTIER" (2026) │ └────────────┬─────────────┘ ┌────────────────────┼────────────────────┐ ▼ ▼ ▼ ┌───────────────────┐ ┌──────────────────────┐ ┌──────────────────────┐ │ REGULATORY │ │ EFFICIENCY │ │ COST │ │ FRONTIER │ │ FRONTIER │ │ FRONTIER │ ├───────────────────┤ ├──────────────────────┤ ├──────────────────────┤ │ Crosses compute │ │ Flagship reasoning │ │ Best price- │ │ thresholds │ │ with streamlined │ │ performance at │ │ (e.g. 10^26 FLOPs) │ │ architecture │ │ inference time │ │ → triggers reporting │ │ → Pareto-optimal │ │ → makes high-volume │ │ obligations │ │ design │ │ deployment viable │ └───────────────────┘ └──────────────────────┘ └──────────────────────┘

In 2026, the idea of a single frontier has fractured into several overlapping frontiers: a regulatory frontier of models that cross formal compute thresholds and trigger reporting obligations under emerging regulation; an efficiency frontier of models achieving flagship-level reasoning with streamlined architectures; and a cost frontier of models that push inference costs down enough to make high-volume deployment economical. A model can lead on one frontier and trail on another — which is precisely why "is X the frontier model?" is increasingly the wrong question to ask.

This matters practically: if you need the absolute best reasoning available today, closed models still tend to lead; if you need cost efficiency, control, and customization, open-weight models increasingly dominate; if you need both, a hybrid approach is often the more resilient long-term strategy.

Table 1 — Frontier Model Landscape (Illustrative, as of early 2026)

Frontier dimension	Defining trait	Representative approach	Practical implication for teams
Regulatory	Crosses a formal compute/FLOP threshold	Flagship closed-lab training runs	Triggers disclosure/compliance obligations; relevant to procurement & legal review
Efficiency	Best reasoning-per-parameter	Streamlined MoE / hybrid architectures	Good fit when you need flagship behavior on constrained hardware
Cost	Best reasoning-per-dollar at inference	Distilled or quantized open-weight models	Good fit for high-volume production traffic

Note: specific model names and benchmark scores move on a roughly monthly cadence — treat any leaderboard snapshot as perishable and re-verify before publishing.

3. New Terminology Glossary: Post-ML, Pre-AGI

This is the vocabulary that didn't really exist (or meant something narrower) in the classical-ML era. Group by where in the lifecycle each term lives.

Table 2 — Term-by-Term Glossary

Term	Plain-English definition	Where it sits in the pipeline
Foundation model	A model trained on broad data, general-purpose enough to be adapted to many downstream tasks	Pretraining output
Frontier model	The most capable model(s) along a given dimension (regulatory/efficiency/cost) at a point in time	Cross-cutting, time-relative
LRM (Large Reasoning Model)	A model architecture-of-use that generates extended chain-of-thought before answering, trading latency for accuracy	Inference-time behavior, enabled by post-training
RLVR (RL with Verifiable Rewards)	Reinforcement learning where the reward is a binary correctness signal (e.g., "did the code pass the test," "does the math check out") rather than a learned human-preference model	Post-training
GRPO (Group Relative Policy Optimization)	An RL algorithm that scores a group of sampled responses against each other (relative ranking) instead of needing a separate value/critic network — the technique behind DeepSeek-R1	Post-training (RL stage)
Test-time compute scaling	Spending more inference compute (longer chains-of-thought, search, voting) instead of more training compute to improve answer quality	Inference
Self-consistency / majority voting	Sampling multiple reasoning paths and taking the most common final answer	Inference-time decoding strategy
Hallucination paradox	Forcing SFT on facts the model never saw in pretraining teaches it the shape of a confident answer before it has the underlying truth	Post-training failure mode
Context engineering	Deliberately designing what information is placed in the context window (vs. relying on parametric memory) to compensate for what post-training cannot create from nothing	Inference-time / system design
Model merging	Combining the parameters of two or more fine-tuned models (rather than retraining jointly) to combine their specialized strengths	Post-training / deployment
Model distillation	Training a smaller "student" model to imitate a larger "teacher" model's outputs (or reasoning traces)	Post-training
Reasoning distillation	Distillation specifically of chains-of-thought from a reasoning model into a smaller or non-reasoning model	Post-training
Sovereign AI / inference-as-infrastructure	The trend of treating model inference (rather than the one-time training run) as the durable, strategically important infrastructure layer	Deployment / strategy
MoE (Mixture of Experts)	An architecture that routes each token to a subset of specialized "expert" sub-networks instead of activating the whole model	Architecture

Figure 2 — Where New Terminology Sits in the Model Lifecycle

PRETRAINING POST-TRAINING INFERENCE ─────────── ────────────── ───────── Foundation model → SFT → Preference opt. → RLVR → Test-time compute MoE architecture (DPO/IPO/KTO) (GRPO) scaling │ Self-consistency ├── Hallucination paradox Context engineering │ (risk if SFT outpaces │ pretraining knowledge) ▼ Model merging / distillation ──────► LRM behavior (compress, combine, specialize) at inference

4. Why "Reasoning Model" Became Its Own Category

Until 2024, "bigger model = better at everything" was the rough heuristic. Test-time reasoning was the key innovation that enabled LLM-based systems to begin exhibiting non-zero fluid intelligence, giving rise to the LRM (Large Reasoning Model) paradigm, first demonstrated by OpenAI's o1 and o3 systems. The mechanism: instead of mapping prompt → answer in one forward pass, the model is trained (via RLVR) to generate a long internal chain-of-thought, and that extra inference compute is what buys the accuracy gain — not more parameters.

DeepSeek-R1 demonstrated that rule-based, verifiable rewards via GRPO can elicit self-correction and reflection in a model without dense human supervision, and this paradigm has since been extended across the Qwen and Kimi model families, and integrated into Gemini and Llama-Nemotron systems.

This is also why "RLHF" as a label has gotten slippery. The term RLHF is still used loosely to refer to any kind of preference-based alignment, but by 2026 its precise meaning has shifted toward more specific successor techniques — DPO and its relatives have mostly replaced the classic PPO-based RLHF pipeline for preference tuning, while GRPO/RLVR have taken over the "make it reason better on verifiable tasks" job.

The Hallucination Paradox (worth its own callout)

If, during SFT, a model is forced to answer questions about facts it never encountered in pretraining, it is effectively being trained to "lie with confidence" — it learns the shape of a correct answer before it possesses the underlying fact, and alignment in a technical sense does not teach new information, it only selects which of the personas latent in pretraining the model should impersonate. This is the single best argument for routing factual gaps to retrieval (RAG) rather than fine-tuning — a point that resurfaces constantly in the fine-tuning section below.

5. The Fine-Tuning Stack: A Taxonomy

The single biggest mental model shift: fine-tuning in 2026 isn't a method, it's three independent choices stacked on top of each other.

Figure 3 — The Three Independent Axes of a Fine-Tuning Decision

AXIS 1: HOW MANY PARAMETERS? AXIS 2: WHAT SUPERVISION SHAPE? AXIS 3: HOW DO YOU EVALUATE? ─────────────────────────── ────────────────────────────── ───────────────────────────── • Feature-based / head-only • SFT (imitate examples) • Held-out eval set • Full fine-tuning (all weights) • Preference opt. (DPO/IPO/KTO/ORPO) • Pairwise human/model judge • PEFT: LoRA / QLoRA / DoRA / GaLore• RL w/ verifiable reward (GRPO/RLVR)• Task-specific benchmark • A/B in production

Fine-tuning in 2026 is no longer a single decision — it is a stack: pick a base model, pick how many parameters you will touch, pick the supervision shape, and pick how you will evaluate the result.

5.1 Axis 1 — How Many Parameters Do You Touch?

Method	What it does	Trainable params	Quality vs. full FT	Best for
Full fine-tuning	Updates every weight in the model	100%	Baseline (100%)	Maximum quality when budget is no constraint; risks catastrophic forgetting
Feature-based / head-only	Freezes the backbone, trains only a final layer/head	<0.1%	Lower; limited expressiveness	Lightweight classification on top of a frozen encoder
LoRA	Injects trainable low-rank matrices A, B into attention/MLP projections; base frozen	~0.1–1% of base model size	90–95%	Default starting point for 2026 fine-tuning workflows
QLoRA	LoRA on top of a 4-bit (NF4) quantized frozen base, with paged optimizers	Same as LoRA	80–90%	Memory-constrained training (consumer GPUs, 70B-class models on one card)
DoRA	Decomposes the weight update into magnitude + direction components	Similar to LoRA	Marginal gain over LoRA	When LoRA underperforms on complex domain shift
GaLore	Gradient low-rank projection; reduces optimizer memory without adapters	Varies	Close to full FT on some tasks	Memory-constrained full fine-tuning (not adapter-based)

Figure 4 — LoRA Mechanics

Frozen base weight W (d × d) │ │ (forward pass uses W + ΔW) ▼ ΔW = B · A where B: d×r, A: r×d, r ≪ d ▲ │ Only A and B are trained (rank r is the key hyperparameter: r=8 small tasks, r=16 sensible default, r=32–64 complex/multi-task)

In practice, three settings drive most of the variation: r=8 is enough capacity for simple style or format tasks; r=16 is the recommended default for instruction fine-tuning and most domain specialization; r=32–64 is reserved for complex domain shift, multi-task training, or datasets above ~10,000 examples. A common heuristic is to start at r=16, establish a working baseline, and only raise the rank if validation loss stagnates, setting alpha equal to the rank or double it.

5.2 Axis 2 — What Supervision Shape?

This is the axis classical ML had almost nothing equivalent to — it's specific to instruction-following and alignment.

Method	Signal used	What it optimizes for	Notes
SFT (Supervised Fine-Tuning)	Labeled input→output pairs (imitation)	Mimicking demonstrated behavior	"You gave the model input-output pairs, it learned to mimic your data, done" — the simplest and still most common first step
DPO (Direct Preference Optimization)	Pairs of (chosen, rejected) responses	Preferring better over worse responses directly, no reward model needed	Replaces the older SFT-then-PPO pipeline for most teams; simpler to implement, competitive quality
IPO / KTO / ORPO	Variants of preference data (KTO uses unpaired binary feedback)	Same family goal as DPO with different stability/data assumptions	"Cousins" of DPO that address specific failure modes of the original formulation
RLHF (classic, PPO-based)	Learned reward model trained on human preferences, optimized via PPO	Preference alignment via full RL loop	Now an escalation path rather than a default — DPO and full fine-tuning are escalation paths, not defaults
RLVR / GRPO	Binary, verifiable correctness signal (unit test passed, math checked)	Improving accuracy on verifiable reasoning tasks	A simplified, scalable form of bootstrapped reasoning that uses answer-matching or constraint verification as a training signal, letting an LLM learn to reason better, especially in math and code

Figure 5 — Supervision Shape, Visualized

SFT DPO/IPO/KTO/ORPO RLVR / GRPO ──── ────────────────── ─────────── "Do this." "Prefer this over that." "Did this check out? +1 / -1" │ │ │ ▼ ▼ ▼ Imitation Pairwise preference Verifiable, rule-based learning optimization reward (no human in loop) (cheap, fast, (no separate reward (best for math/code/ but ceiling'd model needed, stable) structured reasoning) by demo quality)

5.3 Axis 3 — Knowledge vs. Behavior: The Gate Before You Start

The most important practical filter, and one that gets skipped constantly:

The 3-question gate before committing budget: (1) is the gap knowledge or behavior? (2) have you exhausted prompting and retrieval? (3) can you measure success with an eval set? If any answer is "no," stop before you train.

Fine-tuning is for form, not facts — it shapes behavior, style, structured output, and refusal patterns, but it is not the right tool for injecting knowledge that changes weekly; the right sequence in 2026 is prompt → RAG → fine-tune → distill, with the highest-ROI fine-tuning typically being a thin LoRA/QLoRA adapter layered on top of a strong base model, paired with retrieval rather than replacing it.

Figure 6 — The Decision Funnel

Is performance bad? │ ▼ Fix the prompt first ──── still bad? ──┐ ▼ Build/improve RAG ──── still bad, and gap is BEHAVIOR not facts? ──┐ ▼ Fine-tune (LoRA/QLoRA + SFT, then DPO if preference-shaping is needed, then GRPO if it's a verifiable-reasoning task) │ ▼ Distill into a smaller model for cheaper deployment, once the recipe is proven

5.4 Putting It Together — Common Recipes

Goal	Recommended stack	Why
Adapt tone/format/domain style on a budget	QLoRA + SFT	A 7B model fine-tunes for under $5 on rented GPU infrastructure; this is the cheapest path to a working baseline
General instruction-following + safety alignment	LoRA + SFT, then DPO	"The default starting point is LoRA + SFT, with QLoRA when memory is tight and DPO on top for preference alignment"
Better math/code reasoning	SFT (cold start) → GRPO/RLVR	"Layer in GRPO if you care about reasoning" on top of a working SFT baseline
Cheap deployment of a proven reasoning recipe	Reasoning distillation from a larger teacher	Transfers a larger model's reasoning ability into a smaller student, often combined with chain-of-thought compression so the student is equipped for efficient reasoning
Combine multiple specialized fine-tunes without joint retraining	Model merging	Preserves overall performance while integrating contributions from independent workstreams, avoiding the coordination cost of a joint training run

6. Sample Sizes and Cost — A Sanity-Check Table

A common question after the method is chosen: how much data, and how much will it cost?

Table 3 — Rough Data & Cost Anchors (treat as ballpark, not spec)

Scenario	Data needed	Hardware	Rough cost
Noticeable task-specific improvement (LoRA/QLoRA)	500–2,000 high-quality examples	Single consumer/cloud GPU	Low — often single-digit dollars to low hundreds
Meaningful behavioral change	5,000–10,000 examples	Single GPU, longer run	Tens of dollars
QLoRA fine-tune of a 7B model on a consumer card	Task-dependent	A single RTX 4070 Ti-class GPU, in an afternoon	Hardware you may already own
Full fine-tuning of a 7B model	Task-dependent	100–120 GB VRAM, roughly $50,000 of H100 GPUs for one run	High
Genuine frontier-scale pretraining run	N/A	64–256+ H100-class GPUs	€200,000 to €2 million+ for a single run, even at 7–13B parameter scale

The takeaway embedded in this table: the gap between "can't afford this" and "can afford this" used to be a research-team problem; it is now almost entirely a method-selection problem.

7. Putting the Whole Picture Together

Figure 7 — End-to-End Lifecycle Map

PRETRAIN → POST-TRAIN → SERVE / INFER ──────── ─────────── ───────────── Foundation model SFT (imitation) Test-time compute (MoE, FP8, │ scaling data pipelines) ▼ Self-consistency / │ Preference opt. majority voting │ (DPO/IPO/KTO/ORPO) Context engineering │ │ │ │ ▼ ▼ │ RLVR / GRPO LRM behavior at │ (verifiable reasoning) inference, agentic │ │ tool use │ ▼ └──────────► Distillation / Model merging ────────► Cheaper, specialized (compress & combine recipes) deployable models

This is the throughline worth taking away: the center of gravity in the field has moved from pretraining to post-training to inference-time behavior, and the new vocabulary above is essentially a map of that migration. Knowing where a term sits on this lifecycle (pretrain / post-train / inference) is usually enough to know what it's for, even before you know the exact mechanics.