Fine-Tuning Models for Your Domain
TL;DR
Fine-tune when you need to change model behavior, not just knowledge. Use LoRA for parameter-efficient training with as few as 100-500 high-quality examples.
Fine-tuning adapts a pretrained language model to your specific domain, output format, or reasoning style. It is the right tool when prompting alone cannot achieve the behavior you need — but it requires careful dataset preparation and evaluation to produce reliable results.
When to Fine-Tune
Fine-tuning changes model behavior. It does not inject new factual knowledge effectively. Before committing to fine-tuning, determine whether your problem is a knowledge problem or a behavior problem.
| Problem Type | Solution | Example |
|---|---|---|
| Model lacks domain facts | RAG (retrieval-augmented generation) | Querying internal documentation |
| Model output format is wrong | Fine-tuning | Generating structured JSON in a specific schema |
| Model reasoning style is wrong | Fine-tuning | Adopting domain-specific analytical patterns |
| Model lacks recent information | RAG | Answering questions about current events |
| Model tone or style is wrong | Fine-tuning | Matching brand voice or technical writing style |
| Need both facts and behavior | RAG + Fine-tuning | Domain-expert assistant with custom output format |
Fine-tune when you need to change how the model responds. Use RAG when you need to change what the model knows. These are complementary techniques, not alternatives.
Dataset Preparation
The quality of your fine-tuning dataset determines the quality of your fine-tuned model. Every example should be a complete, correct demonstration of the behavior you want.
Dataset Format
Most fine-tuning frameworks expect JSONL (JSON Lines) format, with each line containing a complete conversation:
{"messages": [{"role": "system", "content": "You are a security audit assistant. Analyze code for vulnerabilities and output structured findings."}, {"role": "user", "content": "Review this function:\n\ndef get_user(id):\n query = f\"SELECT * FROM users WHERE id = {id}\"\n return db.execute(query)"}, {"role": "assistant", "content": "## Findings\n\n| Severity | Type | Line | Description |\n|---|---|---|---|\n| Critical | SQL Injection | 2 | User input directly interpolated into SQL query. Use parameterized queries. |\n\n## Remediation\n\n```python\ndef get_user(id):\n query = \"SELECT * FROM users WHERE id = ?\"\n return db.execute(query, (id,))\n```"}]}
{"messages": [{"role": "system", "content": "You are a security audit assistant. Analyze code for vulnerabilities and output structured findings."}, {"role": "user", "content": "Review this endpoint:\n\n@app.route('/search')\ndef search():\n q = request.args.get('q')\n return f'<h1>Results for {q}</h1>'"}, {"role": "assistant", "content": "## Findings\n\n| Severity | Type | Line | Description |\n|---|---|---|---|\n| High | XSS | 4 | User input rendered directly in HTML without escaping. Use template engine with auto-escaping. |\n\n## Remediation\n\n```python\n@app.route('/search')\ndef search():\n q = request.args.get('q')\n return render_template('search.html', query=q)\n```"}]}
Dataset Guidelines
Follow these rules to build an effective training set:
- Minimum 100 examples — fewer risks overfitting; aim for 100–500 for LoRA
- Diverse inputs — cover the full range of inputs the model will encounter
- Consistent output format — every example should follow the exact same structure
- No contradictions — conflicting examples confuse the model during training
- System prompt consistency — use the same system prompt across all examples
Validation Split
Always hold out 10–20% of your data for evaluation. Never train on evaluation data.
import json
import random
with open("dataset.jsonl") as f:
data = [json.loads(line) for line in f]
random.shuffle(data)
split = int(len(data) * 0.85)
train_data = data[:split]
eval_data = data[split:]
with open("train.jsonl", "w") as f:
for item in train_data:
f.write(json.dumps(item) + "\n")
with open("eval.jsonl", "w") as f:
for item in eval_data:
f.write(json.dumps(item) + "\n")
print(f"Train: {len(train_data)}, Eval: {len(eval_data)}")
Fine-Tuning with LoRA
LoRA (Low-Rank Adaptation) trains small adapter matrices that modify the model’s behavior without changing the original weights. This reduces memory requirements by 10–100x compared to full fine-tuning.
Comparing Fine-Tuning Approaches
| Approach | Parameters Trained | GPU Memory | Training Time | Quality |
|---|---|---|---|---|
| Full fine-tuning | All (~7B) | 60+ GB | Hours | Highest |
| LoRA | ~0.1% (8–32M) | 8–16 GB | 30–60 min | Near-full |
| QLoRA | ~0.1% (quantized base) | 4–8 GB | 45–90 min | Good |
| Prompt tuning | ~0.001% (soft tokens) | 4 GB | Minutes | Limited |
For most use cases, LoRA or QLoRA provides the best tradeoff between quality, cost, and speed.
Training Configuration
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer
# Load base model
model_name = "meta-llama/Llama-3.1-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Configure LoRA
lora_config = LoraConfig(
r=16, # Rank — higher = more capacity, more memory
lora_alpha=32, # Scaling factor
target_modules=[
"q_proj", "k_proj",
"v_proj", "o_proj",
],
lora_dropout=0.05,
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 13,631,488 || all params: 8,043,149,312 || 0.17%
# Training arguments
training_args = TrainingArguments(
output_dir="./output",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
warmup_steps=50,
logging_steps=10,
eval_strategy="steps",
eval_steps=50,
save_strategy="steps",
save_steps=100,
bf16=True,
)
# Train
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
tokenizer=tokenizer,
max_seq_length=2048,
)
trainer.train()
Start with a low learning rate (1e-4 to 2e-4) and 2–3 epochs. Monitor validation loss — if it starts increasing while training loss decreases, you are overfitting.
Key Hyperparameters
- LoRA rank (
r) — controls adapter capacity. Start with 16; increase to 32–64 for complex tasks. - Learning rate — 1e-4 to 2e-4 for LoRA. Too high causes catastrophic forgetting.
- Epochs — 2–3 for most datasets. More epochs with small datasets leads to overfitting.
- Batch size — use gradient accumulation to simulate larger batches on limited hardware.
Evaluation
Fine-tuning without rigorous evaluation is guesswork. Define metrics before training and measure them consistently.
Automated Evaluation
- Validation loss — the primary training metric. Should decrease and stabilize.
- Task-specific accuracy — for classification or extraction tasks, measure precision, recall, and F1.
- Format compliance — percentage of outputs that match your expected schema.
Human Evaluation
For open-ended generation tasks, automated metrics are insufficient. Conduct blind comparisons between the base model and fine-tuned model on a held-out test set. Rate outputs on:
- Correctness — is the information accurate?
- Format adherence — does the output match the expected structure?
- Completeness — are all required fields present?
FAQ
When should I fine-tune vs use RAG?
Fine-tune when you need to change the model’s output style, reasoning patterns, or domain expertise. Use RAG when you need to query specific, potentially changing, factual knowledge. Fine-tuning modifies model behavior — how it structures responses, what tone it uses, how it reasons through problems. RAG provides external knowledge at inference time. Many production systems combine both: a fine-tuned model that knows your output format and domain conventions, augmented with RAG for factual grounding.
What is LoRA?
LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that trains small adapter matrices instead of modifying all model weights, reducing compute requirements by 10–100x. It works by decomposing weight updates into low-rank matrices that capture task-specific adaptations. The original model weights remain frozen, so you can swap different LoRA adapters in and out to specialize the same base model for different tasks without maintaining multiple full copies.
How much training data do I need?
Quality matters more than quantity. 100–500 high-quality, diverse examples are often sufficient for LoRA fine-tuning on specific tasks. Each example should be a perfect demonstration of the behavior you want. Poorly formatted, inconsistent, or incorrect examples actively harm model performance. For complex tasks like code generation or multi-step reasoning, aim for the higher end. For simpler tasks like classification or extraction, 100 well-chosen examples can produce strong results.
Comments