What is Fine-Tuning?
Fine-tuning is the process of further training a pre-trained foundation model on a smaller, task-specific dataset to specialize its behavior — adjusting its weights so it performs a particular task, follows a particular style, or handles a particular domain better than the base model. It is the bridge between a general-purpose foundation model and a production application that needs specialized behavior.
When you fine-tune vs. when you prompt
Most teams reach for prompting (system prompts, few-shot examples, RAG) before fine-tuning, because:
- Prompting is faster to iterate
- Prompting doesn't require infrastructure or labeled data
- Prompting changes are reversible
Fine-tuning is the right tool when:
- You have a large, high-quality labeled dataset that captures the desired behavior
- The behavior involves consistent style, format, or terminology that few-shot examples can't reliably enforce
- You need to reduce inference cost (a fine-tuned smaller model can match a prompted larger one)
- You need to encode domain knowledge that's too large or proprietary to fit in every prompt
Common fine-tuning approaches
- Full fine-tuning — update all model weights. Expensive, requires GPU clusters for any reasonably-sized model. Mostly reserved for foundation-model providers.
- LoRA (Low-Rank Adaptation) — train small additive matrices that modify the model's behavior without touching the base weights. Cheap, fast, the dominant technique for application-layer fine-tuning.
- QLoRA — LoRA with quantized base weights, runs on consumer GPUs.
- Instruction tuning — fine-tune on (instruction, response) pairs to make a base model better at following instructions.
- RLHF (Reinforcement Learning from Human Feedback) — fine-tune using human preferences over model outputs as the reward signal. Used by all major foundation-model providers for alignment.
- DPO (Direct Preference Optimization) — alternative to RLHF that uses preference pairs directly without an explicit reward model. Increasingly popular.
Security implications
Fine-tuning is itself an attack surface. Three concrete risks:
- Training data poisoning. Adversarial examples in the fine-tuning corpus can install backdoors that activate on specific triggers, or can systematically degrade the model's safety behavior on targeted inputs. Especially relevant when the corpus is sourced from user content.
- Safety regression. Fine-tuning on benign-looking task data can erode the safety training the foundation model carries. OpenAI, Anthropic, and Google have all documented cases where fine-tuning makes a model less safe even when no harmful content was in the training set.
- Distillation safety gaps. When a smaller model is trained to mimic a foundation model, it inherits the foundation's outputs but not necessarily its refusal training. Repello's research on DeepSeek-R1 distilled variants documented systematic gaps where distilled models produced harmful outputs the original would have blocked.
For any production fine-tuned model, the safety evaluation that was run on the foundation model needs to be re-run on the fine-tuned variant — assumptions don't transfer through training.