Fine-tuning, LoRA, QLoRA, RLHF/DPO — picking the adaptation that fits your budget

Adaptation mode selection drives GPU budget more than model size does.
A base model is a general-purpose prior. Adaptation shapes that prior toward a target task — domain vocabulary, output format, tone, or alignment with human preferences. Platform engineers are not the ones choosing which adaptation technique to use, but they are the ones who must provision the GPU memory, schedule the job fairly against the rest of the cluster, and ensure the resulting artifact is trustworthy enough to promote to production. The choice of adaptation mode determines the resource envelope as much as model size does: a LoRA adapter for a 7B-parameter model can be trained on a single GPU in under two hours, while a full fine-tune of the same model across a 50M-token dataset can run for twelve.
This article covers the four adaptation modes that appear in practice — full fine-tuning, LoRA/PEFT, QLoRA, and RLHF/DPO — and closes with the RAG-vs-fine-tune decision rule. It assumes you have read the earlier workload taxonomy and training articles in this series; cross-references are provided where the scheduling primitives are defined.
The four adaptation modes
The modes differ on one axis above all others: how many weights change. That governs peak GPU memory, whether multi-GPU is required, how long the job runs, and what the output artifact looks like.
Mode 1 — Full fine-tuning
Every weight in the model is updated. The optimizer maintains gradient and momentum tensors for each parameter, so peak memory is approximately 4× the model's fp16 footprint in a standard Adam run (model weights + gradients + two optimizer states). A 7B model in fp16 is roughly 14 GB; full fine-tuning with Adam requires ~56–70 GB peak VRAM, depending on batch size and gradient checkpointing. Techniques such as ZeRO-2 and ZeRO-3 shard the optimizer state and gradients across GPUs, making large full fine-tunes feasible on multi-GPU nodes.
Use full fine-tuning when: (a) the task diverges substantially from the base model's pre-training distribution, (b) the team wants a single deployable artifact with no runtime adapter overhead, or (c) alignment-critical deployments require whitebox access to every layer. Avoid it when the dataset is small — a few thousand examples is insufficient to fine-tune all parameters of a billion-parameter model without severe overfitting.
Mode 2 — LoRA and parameter-efficient fine-tuning (PEFT)
LoRA (Low-Rank Adaptation) freezes the pre-trained model weights and injects small trainable rank-decomposition matrices into the attention layers of the transformer. Only the adapter weights are updated during training. Hu et al. (ICLR 2022) showed that LoRA reduces trainable parameters by up to 10,000× compared to full fine-tuning of GPT-3 175B, with GPU memory requirements reduced by approximately 3×, while matching or exceeding full fine-tune quality on RoBERTa, DeBERTa, GPT-2, and GPT-3 benchmarks — and adding no additional inference latency [S1].
The output artifact is a small adapter file — typically 20–200 MB for a 7B model — that is loaded alongside the frozen base model at inference time. This makes per-tenant LoRA multiplexing practical: a single base model deployment can serve multiple fine-tuned adapters by swapping the adapter weights per request or per session. The Hugging Face PEFT library is the standard implementation [S7].
Mode 3 — QLoRA (quantised LoRA)
QLoRA extends LoRA by loading the base model in 4-bit NF4 (NormalFloat4) quantisation, dramatically reducing peak VRAM while applying LoRA adapters in 16-bit precision. Dettmers et al. (NeurIPS 2023) demonstrated that a 65B-parameter model can be fine-tuned on a single 48 GB GPU without measurable degradation relative to 16-bit full fine-tuning; their Guanaco models reached 99.3% of ChatGPT's performance on the Vicuna benchmark after 24 hours of single-GPU training [S2]. The three key mechanisms are 4-bit NF4 quantisation of base weights, double quantisation to reduce memory overhead of quantisation constants, and paged optimizers to handle memory spikes.
QLoRA is the practical choice when the GPU pool has mid-range cards (24 GB VRAM class) or when GPU memory is scarce. A 7B model in QLoRA requires roughly 12 GB peak VRAM — a single MIG partition of type 3g.40gb on an A100 80 GB can hold the job, leaving the rest of the physical card available for inference workloads.
Mode 4 — RLHF and DPO (alignment-style training)
Alignment training shapes the model's output distribution using preference or reward signals rather than next-token prediction loss. The original RLHF pipeline (Ouyang et al., NeurIPS 2022) runs in three stages: supervised fine-tuning on demonstration data, reward model training on human-ranked output pairs, and PPO-based policy optimisation against the reward model [S3]. The memory requirement is substantial — at 7B policy scale, the policy, reward model, and reference policy must be resident simultaneously, requiring 150–200 GB aggregate VRAM.
Direct Preference Optimization (DPO), introduced by Rafailov et al. (NeurIPS 2023), eliminates the explicit reward model entirely. DPO derives a closed-form mapping from the reward function to the optimal policy, reducing alignment training to a binary classification loss over preference pairs. Rafailov et al. showed that DPO matches or exceeds PPO-based RLHF on summarisation and single-turn dialogue quality benchmarks while being substantially simpler to implement and train [S4]. The memory footprint at 7B scale drops to roughly 60–80 GB — the same as a full fine-tune of a single model.
A third variant, RLAIF (Reinforcement Learning from AI Feedback), replaces human annotators with an LLM judge that generates preference pairs. The downstream training is the same as RLHF or DPO; the difference is the annotation pipeline. RLAIF reduces annotation cost at the expense of a second LLM dependency and the risk of systematic bias from the judge model.
Four modes × five axes: the decision table
The following table maps each adaptation mode to the five axes that govern infrastructure decisions. All wall-clock estimates assume a 50M-token dataset and a 7B-parameter base model unless noted.
| Mode | Peak VRAM (7B) | Data shape | Eval pattern | Typical wall-clock | When to pick |
|-----------------------|-----------------------|------------------------------------|--------------------------------------|---------------------------|----------------------------------------------------------------------------|
| Full fine-tune | 60-70 GB | Task examples (10k-1M) | Held-out set; task metrics | 6-12 h / epoch | Large distribution shift; single merged artifact needed |
| LoRA (r=16) | 24-28 GB | Task examples (1k-500k) | Held-out set; adapter comparison | 1.5-3 h / epoch | Budget-constrained; frequent re-training; per-tenant multiplexing |
| QLoRA (r=64, 4-bit) | 10-12 GB | Same as LoRA | Same as LoRA | 2-4 h / epoch | 24 GB class GPUs; MIG partitioning; high-frequency adapt cycles |
| DPO | 60-80 GB | Preference pairs (chosen/rejected) | Win-rate vs reference; LLM-as-judge | 6-12 h / epoch | Alignment; no annotation pipeline; simpler than RLHF PPO |
| RLHF (PPO) | 150-200 GB aggregate | Preference pairs + demonstrations | Win-rate; human eval | 12-40 h / epoch | High-stakes alignment; custom reward signal; validated annotation pipeline |For 70B models, multiply VRAM figures by roughly 10x and expect multi-node gang-scheduled jobs for full fine-tune and RLHF PPO. LoRA and QLoRA remain single-node for 70B with 2x A100 80 GB cards.
The RAG-vs-fine-tune decision rule
The most common misallocation of GPU budget in applied ML teams is fine-tuning a model to improve its recall of factual domain knowledge when RAG (Retrieval-Augmented Generation) would accomplish the same goal more cheaply and with better currency. Before authorising a fine-tuning job, apply these four diagnostic questions:
- Knowledge recall or behaviour/style? If the model gives wrong domain facts but would answer correctly given those facts in context, the problem is a retrieval problem, not a weight problem. Fine-tuning will not reliably fix factual knowledge gaps — RAG will.
- Stable or frequently-changing knowledge? Fine-tuned weights are static between training runs. If domain knowledge updates daily or weekly, RAG with a refreshed index is the correct architecture — re-fine-tuning on every knowledge update is operationally unsustainable.
- Rare or low-frequency knowledge? Soudani et al. (2024) showed RAG outperforms fine-tuning by a large margin for least-popular factual entities — entities that appear infrequently in pre-training data and are unlikely to be well-represented in a fine-tuning dataset either [S6].
- Provenance required? RAG grounds responses in retrieved documents that can be inspected, cited, and audited. Fine-tuned knowledge is opaque — there is no source document to show an auditor.
A 2024 empirical study (Balaguer et al.) found that fine-tuning alone raised domain QA accuracy by over 6 percentage points, and RAG on top of the fine-tuned model added a further ~5 percentage points — the improvements are cumulative [S5]. The techniques are not mutually exclusive: fine-tune for behaviour and style; use RAG for knowledge and currency.
Decision rule: if a human with access to the same documents could answer the question correctly, the model's failure is a retrieval failure. Fix retrieval first. Fine-tune only after confirming the failure persists with good retrieval in place.
Lifecycle: kick-off to production
Fine-tuning jobs are typically event-driven or schedule-driven: a data drift threshold crosses, a new dataset version lands in the feature store, or a weekly cadence fires. The lifecycle follows five stages:
- Data preparation — dataset versioned and validated in the artifact store. Schema and data-quality checks run before the training job starts. A fine-tune job that begins with corrupt data wastes the entire GPU budget.
- Job submission — submitted via a workflow orchestrator (e.g. Argo Workflows) as a PyTorchJob, admitted by the cluster's quota system when GPU capacity is available.
- Training with checkpointing — the job writes intermediate checkpoints to durable object storage every 30–60 minutes. A node failure at any point recovers from the last checkpoint. Writing to node-local storage is the most common cause of lost fine-tuning runs.
- Evaluation gate — the artifact is evaluated on a held-out set before promotion. For alignment-style training, LLM-as-judge win-rate evaluation is common. A fine-tune that produces a valid checkpoint but fails the eval gate is not done — it is a failed run.
- Registry promotion — the artifact is registered with a version in a model registry (e.g. MLflow), transitioning through staging to production. The registry version is what the serving infrastructure pulls; it must be immutable once tagged production.
Scheduling fit by mode
Fine-tuning's scheduling requirements differ sharply by mode, which is why mode selection must happen before infrastructure provisioning:
| Mode | Gang scheduling? | GPU sharing appropriate? |
|-----------------------------|-----------------------------------------------|------------------------------------------|
| LoRA / QLoRA single-GPU | No - single pod | Yes - MIG partition (small footprint) |
| Full fine-tune single-node | No - topology hints recommended | No - whole GPU(s) |
| Full fine-tune multi-node | Yes - all workers must start atomically | No - whole GPU(s) |
| DPO | No - single-model job | No - whole GPU(s) |
| RLHF PPO | Yes - policy + reward model co-schedule | No - whole GPU(s) |Single-GPU LoRA and QLoRA jobs are quota-admitted as standard single-worker training jobs. For multi-node full fine-tunes and RLHF PPO, gang scheduling ensures all worker pods start atomically — a partial start where some workers acquire GPUs and others queue causes deadlock and wastes the allocated capacity. The batch scheduler (e.g. Volcano, or Kueue with gang semantics) coordinates this via a pod group that schedules only when all minimum members can be satisfied simultaneously.
For high-frequency LoRA fine-tuning pipelines, GPU partitioning (MIG on capable hardware, or time-slicing on smaller cards) allows a single physical GPU to hold multiple concurrent small fine-tune jobs. The 12 GB peak VRAM of a 7B QLoRA job fits into a sub-partition of an 80 GB card, leaving the remainder available for other workloads. On-prem GPU pools are the natural home for this pattern — the cost is sunk and most fine-tuning datasets should not leave the network boundary.
Common pitfalls
No eval gate. Fine-tunes that skip evaluation and write directly to production are the most common source of silent model degradation. Gate every registration on a held-out evaluation; automated LLM-as-judge checks are acceptable for alignment-style training where human evaluation is too slow.
Wrong learning rate for the mode. LoRA tolerates higher learning rates than full fine-tuning: typical peak LRs for LoRA are 1e-4 to 3e-4; full fine-tunes run at 1e-5 to 5e-5. Applying a pre-training LR schedule (cosine with warmup over billions of steps) to a fine-tuning job of a few thousand steps is a path to divergence.
Catastrophic forgetting in full fine-tunes. The model loses capability on base tasks while specialising. Mitigate with data mixing (interleave general instruction data with task-specific data), or use LoRA instead of full fine-tuning when the task does not require updating every layer.
RLHF reward hacking. The policy finds high-scoring outputs that do not match human intent. Guard with KL divergence penalties and periodic human spot-checks. If the team lacks a vetted annotation pipeline, DPO is lower-risk.
Unmerged LoRA adapters in high-QPS serving. For high-throughput single-adapter serving, merge the adapter into the base weights once (using PEFT's merge_and_unload()) and deploy the merged model. The unmerged path is correct for per-tenant multiplexing, not for single-adapter high-QPS workloads. Serving cost implications are in the inference article (series-order/11).
Checkpoint to node-local storage. A node preemption or hardware failure loses the entire run. Write checkpoints to durable object storage. Not optional on clusters with spot or preemptible nodes.
Fine-tuning when you need RAG. If the model gives wrong answers about frequently-changing or rarely-seen domain facts, adding a retrieval layer will outperform fine-tuning at a fraction of the GPU budget. Apply the decision rule above before authorising the job.
Serving implications
The adaptation mode chosen in training directly constrains the serving architecture. A merged full fine-tune artifact deploys as an ordinary model checkpoint — no special runtime required. An unmerged LoRA adapter requires the serving runtime to load both base model and adapter, and to support adapter hot-swapping for multi-tenant deployments. For LoRA multiplexing at scale, a serving runtime that supports in-memory adapter management (e.g. vLLM's LoRA support, or equivalent) is necessary to avoid cold-loading the adapter per request. The throughput, latency, and autoscaling implications of each serving mode are in the online and batch inference article (series-order/11).
References
- [S1] Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2022). LoRA: Low-Rank Adaptation of Large Language Models. ICLR 2022. arxiv.org/abs/2106.09685
- [S2] Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023). QLoRA: Efficient Finetuning of Quantized LLMs. NeurIPS 2023. arxiv.org/abs/2305.14314
- [S3] Ouyang, L., Wu, J., et al. (2022). Training language models to follow instructions with human feedback (InstructGPT). NeurIPS 2022. arxiv.org/abs/2203.02155
- [S4] Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., & Finn, C. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. NeurIPS 2023. arxiv.org/abs/2305.18290
- [S5] Balaguer, A., Benara, V., et al. (2024). RAG vs Fine-tuning: Pipelines, Tradeoffs, and a Case Study on Agriculture. arxiv.org/abs/2401.08406
- [S6] Soudani, H., Kanoulas, E., & Hasibi, F. (2024). Fine Tuning vs. Retrieval Augmented Generation for Less Popular Knowledge. arxiv.org/abs/2403.01432
- [S7] Hugging Face. PEFT: Parameter-Efficient Fine-Tuning documentation. huggingface.co/docs/peft
- [S8] Hugging Face. TRL: Transformer Reinforcement Learning. github.com/huggingface/trl
Tags
About the Author

asleekgeek
Senior Developer, Architect, DevOps
Owner and main author "ASleekGeek website" #husband #father #software-developer #geek #reader-of-all-things #food-lover #mufc-fan #aspiring-guitarist
Thanks for reading! Explore more articles.
Back to Articles