When should you use prompt engineering versus fine-tuning to adapt an LLM?
Prompt engineering is the right starting point when the task can be described in natural language, the required knowledge already exists in the base model, and iteration speed matters — no training required. Fine-tuning is warranted when you need consistent output format at scale, domain-specific style that prompts cannot reliably impose, or when latency and token costs from long system prompts are prohibitive.
How to think about it
The most common mistake practitioners make is jumping to fine-tuning before exhausting prompt engineering. Fine-tuning is slower, costlier, and harder to iterate on — but it wins decisively in several scenarios.
When prompt engineering is the better choice
- Rapid prototyping: no data collection, no training run; iterate in hours.
- Behavioural guidance: chain-of-thought, role assignment, and few-shot examples cover most reasoning and tone requirements.
- Knowledge is already in the model: asking a capable base model to summarise, classify, or translate needs no weight updates.
- Low volume: if you are running hundreds of queries per day, a long system prompt is affordable.
When fine-tuning earns its cost
- Rigid output format: structured JSON, SQL templates, or domain-specific schemas are hard to guarantee through prompting alone; fine-tuning reliably bakes in format adherence.
- Style and voice: a brand’s specific writing style, or a medical documentation standard, is difficult to capture in a prompt.
- Latency / cost at scale: a fine-tuned smaller model can match a prompted larger model at a fraction of the inference cost.
- Proprietary domain knowledge: niche jargon, internal codebases, or specialized terminology that the base model lacks.
Decision heuristic
1. Does a good prompt + few-shot examples already give acceptable results?
→ Yes: ship the prompt. No: continue.
2. Is the gap about format/style consistency, not missing knowledge?
→ Yes: fine-tune on (instruction, output) pairs.
No: add retrieval (RAG) for missing knowledge first, then re-evaluate.
3. Is the volume high enough to amortise training cost?
→ No: consider a prompted larger model. Yes: fine-tune a smaller one.
Parameter-efficient fine-tuning (PEFT)
LoRA and QLoRA significantly lower the barrier by updating only a small number of injected adapter weights, leaving the base model frozen. This reduces GPU memory and training time by an order of magnitude, making fine-tuning practical on single consumer GPUs.