AI Platform Engineering & MLOps Series · Part 12 of 34

Fine-tuning, LoRA, QLoRA, RLHF/DPO — picking the adaptation that fits your budget

Four adaptation modes — full fine-tune, LoRA, QLoRA, RLHF/DPO — mapped to GPU memory, data shape, wall-clock, and the RAG-vs-fine-tune decision rule every ML team needs.

12 min read·2 interactive components·8 references

Full fine-tuneLoRA (frozen base + adapters)QLoRA (4-bit base + adapters)RLHF / DPO (policy + reference)Frozen weights

A base model is a general-purpose prior. Adaptation shapes that prior toward a target task — domain vocabulary, output format, tone, or alignment with human preferences. Platform engineers are not the ones choosing which adaptation technique to use, but they are the ones who must provision the GPU memory, schedule the job fairly against the rest of the cluster, and ensure the resulting artifact is trustworthy enough to promote to production. The choice of adaptation mode determines the resource envelope as much as model size does: a LoRA adapter for a 7B-parameter model can be trained on a single GPU in under two hours, while a full fine-tune of the same model across a 50M-token dataset can run for twelve.

This article covers the four adaptation modes that appear in practice — full fine-tuning, LoRA/PEFT, QLoRA, and RLHF/DPO — and closes with the RAG-vs-fine-tune decision rule. It assumes you have read the earlier workload taxonomy and training articles in this series; cross-references are provided where the scheduling primitives are defined.

The four adaptation modes

The modes differ on one axis above all others: how many weights change. That governs peak GPU memory, whether multi-GPU is required, how long the job runs, and what the output artifact looks like.

Mode 1 — Full fine-tuning

Every weight in the model is updated. The optimizer maintains gradient and momentum tensors for each parameter, so peak memory is approximately 4× the model’s fp16 footprint in a standard Adam run (model weights + gradients + two optimizer states). A 7B model in fp16 is roughly 14 GB; full fine-tuning with Adam requires ~56–70 GB peak VRAM, depending on batch size and gradient checkpointing. Techniques such as ZeRO-2 and ZeRO-3 shard the optimizer state and gradients across GPUs, making large full fine-tunes feasible on multi-GPU nodes.

Use full fine-tuning when: (a) the task diverges substantially from the base model’s pre-training distribution, (b) the team wants a single deployable artifact with no runtime adapter overhead, or (c) alignment-critical deployments require whitebox access to every layer. Avoid it when the dataset is small — a few thousand examples is insufficient to fine-tune all parameters of a billion-parameter model without severe overfitting.

Mode 2 — LoRA and parameter-efficient fine-tuning (PEFT)

LoRA (Low-Rank Adaptation) freezes the pre-trained model weights and injects small trainable rank-decomposition matrices into the attention layers of the transformer. Only the adapter weights are updated during training. Hu et al. (ICLR 2022) showed that LoRA reduces trainable parameters by up to 10,000× compared to full fine-tuning of GPT-3 175B, with GPU memory requirements reduced by approximately 3×, while matching or exceeding full fine-tune quality on RoBERTa, DeBERTa, GPT-2, and GPT-3 benchmarks — and adding no additional inference latency [S1].

The output artifact is a small adapter file — typically 20–200 MB for a 7B model — that is loaded alongside the frozen base model at inference time. This makes per-tenant LoRA multiplexing practical: a single base model deployment can serve multiple fine-tuned adapters by swapping the adapter weights per request or per session. The Hugging Face PEFT library is the standard implementation [S7].

Mode 3 — QLoRA (quantised LoRA)

QLoRA extends LoRA by loading the base model in 4-bit NF4 (NormalFloat4)quantisation, dramatically reducing peak VRAM while applying LoRA adapters in 16-bit precision. Dettmers et al. (NeurIPS 2023) demonstrated that a 65B-parameter model can be fine-tuned on a single 48 GB GPU without measurable degradation relative to 16-bit full fine-tuning; their Guanaco models reached 99.3% of ChatGPT’s performance on the Vicuna benchmark after 24 hours of single-GPU training [S2]. The three key mechanisms are 4-bit NF4 quantisation of base weights, double quantisation to reduce memory overhead of quantisation constants, and paged optimizers to handle memory spikes.

QLoRA is the practical choice when the GPU pool has mid-range cards (24 GB VRAM class) or when GPU memory is scarce. A 7B model in QLoRA requires roughly 12 GB peak VRAM — a single MIG partition of type 3g.40gb on an A100 80 GB can hold the job, leaving the rest of the physical card available for inference workloads.

Mode 4 — RLHF and DPO (alignment-style training)

Alignment training shapes the model’s output distribution using preference or reward signals rather than next-token prediction loss. The original RLHF pipeline (Ouyang et al., NeurIPS 2022) runs in three stages: supervised fine-tuning on demonstration data, reward model training on human-ranked output pairs, and PPO-based policy optimisation against the reward model [S3]. The memory requirement is substantial — at 7B policy scale, the policy, reward model, and reference policy must be resident simultaneously, requiring 150–200 GB aggregate VRAM.

Direct Preference Optimization (DPO), introduced by Rafailov et al. (NeurIPS 2023), eliminates the explicit reward model entirely. DPO derives a closed-form mapping from the reward function to the optimal policy, reducing alignment training to a binary classification loss over preference pairs. Rafailov et al. showed that DPO matches or exceeds PPO-based RLHF on summarisation and single-turn dialogue quality benchmarks while being substantially simpler to implement and train [S4]. The memory footprint at 7B scale drops to roughly 60–80 GB — the same as a full fine-tune of a single model.

A third variant, RLAIF (Reinforcement Learning from AI Feedback), replaces human annotators with an LLM judge that generates preference pairs. The downstream training is the same as RLHF or DPO; the difference is the annotation pipeline. RLAIF reduces annotation cost at the expense of a second LLM dependency and the risk of systematic bias from the judge model.

Four modes × five axes: the decision table

The following table maps each adaptation mode to the five axes that govern infrastructure decisions. All wall-clock estimates assume a 50M-token dataset and a 7B-parameter base model unless noted.

Mode	Peak VRAM (7B)	Data shape	Eval pattern	Typical wall-clock	When to pick
Full fine-tune	60–70 GB	Task examples (10k–1M)	Held-out set; task metrics	6–12 h / epoch	Large distribution shift; single merged artifact needed
LoRA (r=16)	24–28 GB	Task examples (1k–500k)	Held-out set; adapter comparison	1.5–3 h / epoch	Budget-constrained; frequent re-training; per-tenant multiplexing
QLoRA (r=64, 4-bit)	10–12 GB	Same as LoRA	Same as LoRA	2–4 h / epoch	24 GB class GPUs; MIG partitioning; high-frequency adapt cycles
DPO	60–80 GB	Preference pairs (chosen/rejected)	Win-rate vs reference; LLM-as-judge	6–12 h / epoch	Alignment; no annotation pipeline; simpler than RLHF PPO
RLHF (PPO)	150–200 GB aggregate	Preference pairs + demonstrations	Win-rate; human eval	12–40 h / epoch	High-stakes alignment; custom reward signal; validated annotation pipeline

For 70B models, multiply VRAM figures by roughly 10x and expect multi-node gang-scheduled jobs for full fine-tune and RLHF PPO. LoRA and QLoRA remain single-node for 70B with 2x A100 80 GB cards.

The calculator below applies the table’s baselines interactively: choose a model size and an adaptation mode to see the peak-VRAM envelope, the GPU class it fits, and the scheduling guidance from this article.

VRAM Budget Calculator

Pick a model size, then compare the peak-VRAM envelope of each adaptation mode against the GPU classes it must fit. Figures are the article’s decision-table baselines.

Base model size

Peak VRAM by mode (tap a bar for detail)

RTX 4090 · 24 GBMIG 3g.40gb · 40 GBL40S · 48 GBA100/H100 · 80 GB2× A100 80 GB · 160 GB

LoRA (r=16) on a 7B base — 24–28 GB peak VRAM

Fits: MIG 3g.40gb · 40 GB

Typical wall-clock (7B, 50M tokens): 1.5–3 h / epoch
Data shape: Task examples (1k–500k)
Eval pattern: Held-out set; adapter comparison
Output artifact: Adapter file, typically 20–200 MB, loaded beside the frozen base

Scheduling fit

Single pod — no gang scheduling. MIG partition or time-slicing appropriate: the footprint is small and a whole GPU would be underutilised.

All figures are the article’s 7B baselines (50M-token dataset, one epoch unless noted).

The RAG-vs-fine-tune decision rule

The most common misallocation of GPU budget in applied ML teams is fine-tuning a model to improve its recall of factual domain knowledge when RAG (Retrieval-Augmented Generation) would accomplish the same goal more cheaply and with better currency. Before authorising a fine-tuning job, apply these four diagnostic questions:

1Knowledge recall or behaviour/style? If the model gives wrong domain facts but would answer correctly given those facts in context, the problem is a retrieval problem, not a weight problem. Fine-tuning will not reliably fix factual knowledge gaps — RAG will.
2Stable or frequently-changing knowledge? Fine-tuned weights are static between training runs. If domain knowledge updates daily or weekly, RAG with a refreshed index is the correct architecture — re-fine-tuning on every knowledge update is operationally unsustainable.
3Rare or low-frequency knowledge? Soudani et al. (2024) showed RAG outperforms fine-tuning by a large margin for least-popular factual entities — entities that appear infrequently in pre-training data and are unlikely to be well-represented in a fine-tuning dataset either [S6].
4Provenance required? RAG grounds responses in retrieved documents that can be inspected, cited, and audited. Fine-tuned knowledge is opaque — there is no source document to show an auditor.

A 2024 empirical study (Balaguer et al.) found that fine-tuning alone raised domain QA accuracy by over 6 percentage points, and RAG on top of the fine-tuned model added a further ~5 percentage points — the improvements are cumulative [S5]. The techniques are not mutually exclusive: fine-tune for behaviour and style; use RAG for knowledge and currency.

Decision rule: if a human with access to the same documents could answer the question correctly, the model’s failure is a retrieval failure. Fix retrieval first. Fine-tune only after confirming the failure persists with good retrieval in place.

The picker below walks the budget, data, and quality constraints from this article — including the RAG gate — and routes to the adaptation mode that fits.

Adaptation Mode Picker

Walk the article’s budget, data, and quality constraints in order to route your task to RAG or one of the four adaptation modes.

Question 1

Is the model failing on factual domain knowledge it would answer correctly if the right documents were in its context?

Wrong domain facts, frequently-changing knowledge, or rare entities — the four RAG diagnostics from this article.

Lifecycle: kick-off to production

Fine-tuning jobs are typically event-driven or schedule-driven: a data drift threshold crosses, a new dataset version lands in the feature store, or a weekly cadence fires. The lifecycle follows five stages:

1Data preparation — dataset versioned and validated in the artifact store. Schema and data-quality checks run before the training job starts. A fine-tune job that begins with corrupt data wastes the entire GPU budget.
2Job submission — submitted via a workflow orchestrator (e.g. Argo Workflows) as a PyTorchJob, admitted by the cluster’s quota system when GPU capacity is available.
3Training with checkpointing — the job writes intermediate checkpoints to durable object storage every 30–60 minutes. A node failure at any point recovers from the last checkpoint. Writing to node-local storage is the most common cause of lost fine-tuning runs.
4Evaluation gate — the artifact is evaluated on a held-out set before promotion. For alignment-style training, LLM-as-judge win-rate evaluation is common. A fine-tune that produces a valid checkpoint but fails the eval gate is not done — it is a failed run.
5Registry promotion — the artifact is registered with a version in a model registry (e.g. MLflow), transitioning through staging to production. The registry version is what the serving infrastructure pulls; it must be immutable once tagged production.

The two stages platform engineers most often see skipped are the evaluation gate and durable checkpointing — both reappear in the pitfalls below, and both are enforced by the model registry workflow rather than by the training code itself.

Scheduling fit by mode

Fine-tuning’s scheduling requirements differ sharply by mode, which is why mode selection must happen before infrastructure provisioning:

Mode	Gang scheduling?	GPU sharing appropriate?
LoRA / QLoRA single-GPU	No — single pod	Yes — MIG partition (small footprint)
Full fine-tune single-node	No — topology hints recommended	No — whole GPU(s)
Full fine-tune multi-node	Yes — all workers must start atomically	No — whole GPU(s)
DPO	No — single-model job	No — whole GPU(s)
RLHF PPO	Yes — policy + reward model co-schedule	No — whole GPU(s)

Single-GPU LoRA and QLoRA jobs are quota-admitted as standard single-worker training jobs. For multi-node full fine-tunes and RLHF PPO, gang scheduling ensures all worker pods start atomically — a partial start where some workers acquire GPUs and others queue causes deadlock and wastes the allocated capacity. The batch scheduler (e.g. Volcano, or Kueue with gang semantics) coordinates this via a pod group that schedules only when all minimum members can be satisfied simultaneously.

For high-frequency LoRA fine-tuning pipelines, GPU partitioning (MIG on capable hardware, or time-slicing on smaller cards) allows a single physical GPU to hold multiple concurrent small fine-tune jobs. The 12 GB peak VRAM of a 7B QLoRA job fits into a sub-partition of an 80 GB card, leaving the remainder available for other workloads. On-prem GPU pools are the natural home for this pattern — the cost is sunk and most fine-tuning datasets should not leave the network boundary.

Common pitfalls

No eval gate. Fine-tunes that skip evaluation and write directly to production are the most common source of silent model degradation. Gate every registration on a held-out evaluation; automated LLM-as-judge checks are acceptable for alignment-style training where human evaluation is too slow.

Wrong learning rate for the mode. LoRA tolerates higher learning rates than full fine-tuning: typical peak LRs for LoRA are 1e-4 to 3e-4; full fine-tunes run at 1e-5 to 5e-5. Applying a pre-training LR schedule (cosine with warmup over billions of steps) to a fine-tuning job of a few thousand steps is a path to divergence.

Catastrophic forgetting in full fine-tunes. The model loses capability on base tasks while specialising. Mitigate with data mixing (interleave general instruction data with task-specific data), or use LoRA instead of full fine-tuning when the task does not require updating every layer.

RLHF reward hacking. The policy finds high-scoring outputs that do not match human intent. Guard with KL divergence penalties and periodic human spot-checks. If the team lacks a vetted annotation pipeline, DPO is lower-risk.

Unmerged LoRA adapters in high-QPS serving. For high-throughput single-adapter serving, merge the adapter into the base weights once (using PEFT’s merge_and_unload()) and deploy the merged model. The unmerged path is correct for per-tenant multiplexing, not for single-adapter high-QPS workloads. Serving cost implications are in the inference article (series-order/11).

Checkpoint to node-local storage. A node preemption or hardware failure loses the entire run. Write checkpoints to durable object storage. Not optional on clusters with spot or preemptible nodes.

Fine-tuning when you need RAG. If the model gives wrong answers about frequently-changing or rarely-seen domain facts, adding a retrieval layer will outperform fine-tuning at a fraction of the GPU budget. Apply the decision rule above before authorising the job.

Serving implications

The adaptation mode chosen in training directly constrains the serving architecture. A merged full fine-tune artifact deploys as an ordinary model checkpoint — no special runtime required. An unmerged LoRA adapter requires the serving runtime to load both base model and adapter, and to support adapter hot-swapping for multi-tenant deployments. For LoRA multiplexingat scale, a serving runtime that supports in-memory adapter management (e.g. vLLM’s LoRA support, or equivalent) is necessary to avoid cold-loading the adapter per request. The throughput, latency, and autoscaling implications of each serving mode are in the online and batch inference article (series-order/11).

References

[S1] Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2022). LoRA: Low-Rank Adaptation of Large Language Models. ICLR 2022. arxiv.org/abs/2106.09685
[S2] Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023). QLoRA: Efficient Finetuning of Quantized LLMs. NeurIPS 2023. arxiv.org/abs/2305.14314
[S3] Ouyang, L., Wu, J., et al. (2022). Training language models to follow instructions with human feedback (InstructGPT). NeurIPS 2022. arxiv.org/abs/2203.02155
[S4] Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., & Finn, C. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. NeurIPS 2023. arxiv.org/abs/2305.18290
[S5] Balaguer, A., Benara, V., et al. (2024). RAG vs Fine-tuning: Pipelines, Tradeoffs, and a Case Study on Agriculture. arxiv.org/abs/2401.08406
[S6] Soudani, H., Kanoulas, E., & Hasibi, F. (2024). Fine Tuning vs. Retrieval Augmented Generation for Less Popular Knowledge. arxiv.org/abs/2403.01432
[S7] Hugging Face. PEFT: Parameter-Efficient Fine-Tuning documentation. huggingface.co/docs/peft
[S8] Hugging Face. TRL: Transformer Reinforcement Learning. github.com/huggingface/trl

Continue the Journey

AI Platform