Fine-Tuning
Permanently teach a model new behaviour. Learn how fine-tuning differs from prompting and RAG, when each is the right tool, instruction-tuning, LoRA, and how to source knowledge.
In this lecture
14.1 The Optimization Spectrum
Out of the box, LLMs are generalists. To build effective applications we guide them. There are three levels of optimization:
| Approach | What it is | Changes the model? |
|---|---|---|
| 1. Prompting "In-context learning" | Giving instructions/examples in the input window | No — fastest to implement |
| 2. RAG "Retrieval-Augmented Generation" | Injecting relevant external knowledge into the prompt | No — best for adding knowledge |
| 3. Fine-Tuning "Weight adaptation" | Training the model on a dataset to permanently change its behaviour | Yes — best for complex behaviours |
The context window
Everything in prompting and RAG happens inside the context window — the limited amount of text the model can consider at once. It holds the system instructions, conversation history, user input and any retrieved knowledge. A huge system prompt leaves less room for the conversation.
14.2 Fine-Tuning vs Prompting (and RAG)
| Need | Best approach |
|---|---|
| New knowledge (news, company data) | RAG — retrieval |
| A specific output format (JSON, code) | Prompting — few-shot |
| A consistent tone / style | Fine-Tuning |
| The prompt is too long / expensive | Fine-Tuning (bake instructions in) |
| Fixing occasional reasoning errors | Chain-of-Thought prompting |
RAG vs Fine-Tuning for knowledge
| Feature | Fine-Tuning | RAG (Knowledge Base) |
|---|---|---|
| Knowledge update | Slow — needs retraining | Instant — just update the database |
| Accuracy on facts | Prone to hallucination | High — grounded in retrieved facts |
| Citations | Difficult | Easy — direct references |
| Privacy | Data baked into the model | Data stays separate & controlled |
14.3 Situations for Fine-Tuning
Fine-tune when you need:
- A specialised behaviour or style — e.g. teaching a model to consistently write in legal language ("legalese") or a brand voice.
- A consistent output format that few-shot prompting cannot reliably enforce.
- Lower cost & latency — bake long instructions into the weights so prompts can be short (saving tokens on every call).
- Small-model performance — fine-tune a small model (e.g. Llama-7B) so it performs as well as a large model on one specific task.
14.4 Instruction-Tuning & LoRA
Instruction-Tuning
The dataset — JSONL format
Fine-tuning needs a high-quality dataset, usually in JSONL (one JSON object per line). Quality matters more than quantity.
{"messages": [{"role": "user", "content": "Refund my order"},
{"role": "assistant", "content": "I'm sorry to hear that. Could you share your order ID?"}]}
{"messages": [{"role": "user", "content": "Where is my package?"},
{"role": "assistant", "content": "Let me check — please provide your tracking number."}]}
For style transfer, as few as 50–100 high-quality examples can work; for complex reasoning you might need 1,000. Always: quality > quantity.
PEFT & LoRA
from openai import OpenAI
client = OpenAI()
# 1. Upload the JSONL training file
file = client.files.create(
file=open("training_data.jsonl", "rb"),
purpose="fine-tune"
)
# 2. Start the fine-tuning job
job = client.fine_tuning.jobs.create(
training_file=file.id,
model="gpt-4o-mini-2024-07-18"
)
print("Fine-tuning job started:", job.id)
14.5 Knowledge Sourcing
The knowledge gap
LLMs are frozen in time — they only know what they learned during pre-training. They do not know your private company data or news from yesterday, and they can hallucinate details about specific documents. The right knowledge source fixes this:
- Pre-trained knowledge — general world knowledge; free but static and possibly outdated.
- RAG (retrieved knowledge) — for facts that change or are private; instantly updatable, grounded, citable.
- Fine-tuned knowledge — for deeply embedding behaviour/style; permanent but slow to update.
- State-of-the-art: combine them — fine-tune a model to be excellent at using retrieved context (instruction-tuning), then use RAG to feed it the latest data.
Knowing when to prompt vs RAG vs fine-tune is the most-tested idea here.
Fine-tuning differs from prompting because fine-tuning:
Prompting is temporary, in-context guidance. Fine-tuning actually retrains the model, permanently changing its parameters and behaviour.
You need the model to answer using your company's latest internal documents. The best approach is:
For new or frequently-changing knowledge, RAG is best — it retrieves the documents at query time and is instantly updatable, unlike slow retraining.
Fine-tuning is the best choice when you need:
Fine-tuning bakes in a behaviour/style and lets you shorten prompts (saving tokens). Changing facts → RAG; latest news → RAG.
Instruction-tuning trains a model on:
Instruction-tuning uses (instruction → desired output) pairs, teaching a base model to behave like a helpful, instruction-following assistant.
LoRA (Low-Rank Adaptation) makes fine-tuning cheaper by:
LoRA freezes the base weights and trains only a small adapter (~1% of parameters), so it runs on consumer GPUs, trains fast, and is modular.
For fine-tuning datasets, the guiding rule is:
A small, clean, high-quality dataset (even 50–100 examples for style) outperforms a large messy one. Bad examples teach bad behaviour.
Does fine-tuning reliably stop hallucinations?
Fine-tuning teaches behaviour, not factual grounding — errors in the training data can worsen hallucinations. RAG grounds answers in retrieved text, making it better for factuality.
According to the recommended optimization strategy, you should:
Prompting solves ~80% of problems and is fastest. Add RAG for external knowledge. Fine-tune last — only when you must reduce cost/latency or deeply bake in behaviour.
A startup wants its chatbot to (a) always reply in a friendly brand voice and (b) answer using its constantly-updated product catalogue. Which technique for each, and why?
(a) A consistent brand voice/style is a behaviour → fine-tuning, which permanently bakes the tone into the model. (b) A constantly-updated catalogue is changing knowledge → RAG, which retrieves the current catalogue at query time and can be updated instantly without retraining. The state-of-the-art is to combine both.
Write two lines of a JSONL fine-tuning dataset that teach a model to answer customer-support questions politely.
{"messages": [{"role": "user", "content": "My order is late"},
{"role": "assistant", "content": "I'm sorry for the delay. Could you share your order ID so I can check?"}]}
{"messages": [{"role": "user", "content": "I want a refund"},
{"role": "assistant", "content": "I understand. I'd be happy to help - may I have your order number?"}]}
Each line is one JSON object pairing a user message with the desired assistant reply. Quality and consistency of these examples matter most.
Write code to upload a JSONL file and start an OpenAI fine-tuning job.
from openai import OpenAI
client = OpenAI()
# Step 1: upload the dataset
training_file = client.files.create(
file=open("data.jsonl", "rb"),
purpose="fine-tune"
)
# Step 2: create the fine-tuning job
job = client.fine_tuning.jobs.create(
training_file=training_file.id,
model="gpt-4o-mini-2024-07-18"
)
print("Job ID:", job.id)
Why are LLMs said to be "frozen in time", and what are the two main ways to give them new knowledge?
An LLM only knows what was in its pre-training data up to a cutoff date — after training, its weights are fixed ("frozen"), so it does not know recent events or private data. The two ways to add knowledge are RAG (retrieve external documents at query time — instant, citable, best for changing facts) and fine-tuning (retrain the weights — permanent, slow to update, better for behaviour than facts).