Fine-Tuning Models for Your Domain

Fine-tuning adapts a pretrained language model to your specific domain, output format, or reasoning style. It is the right tool when prompting alone cannot achieve the behavior you need — but it requires careful dataset preparation and evaluation to produce reliable results.

When to Fine-Tune

Fine-tuning changes model behavior. It does not inject new factual knowledge effectively. Before committing to fine-tuning, determine whether your problem is a knowledge problem or a behavior problem.

Problem Type	Solution	Example
Model lacks domain facts	RAG (retrieval-augmented generation)	Querying internal documentation
Model output format is wrong	Fine-tuning	Generating structured JSON in a specific schema
Model reasoning style is wrong	Fine-tuning	Adopting domain-specific analytical patterns
Model lacks recent information	RAG	Answering questions about current events
Model tone or style is wrong	Fine-tuning	Matching brand voice or technical writing style
Need both facts and behavior	RAG + Fine-tuning	Domain-expert assistant with custom output format

Fine-tune when you need to change how the model responds. Use RAG when you need to change what the model knows. These are complementary techniques, not alternatives.

Dataset Preparation

The quality of your fine-tuning dataset determines the quality of your fine-tuned model. Every example should be a complete, correct demonstration of the behavior you want.

Dataset Format

Most fine-tuning frameworks expect JSONL (JSON Lines) format, with each line containing a complete conversation:

{"messages": [{"role": "system", "content": "You are a security audit assistant. Analyze code for vulnerabilities and output structured findings."}, {"role": "user", "content": "Review this function:\n\ndef get_user(id):\n    query = f\"SELECT * FROM users WHERE id = {id}\"\n    return db.execute(query)"}, {"role": "assistant", "content": "## Findings\n\n| Severity | Type | Line | Description |\n|---|---|---|---|\n| Critical | SQL Injection | 2 | User input directly interpolated into SQL query. Use parameterized queries. |\n\n## Remediation\n\n```python\ndef get_user(id):\n    query = \"SELECT * FROM users WHERE id = ?\"\n    return db.execute(query, (id,))\n```"}]}
{"messages": [{"role": "system", "content": "You are a security audit assistant. Analyze code for vulnerabilities and output structured findings."}, {"role": "user", "content": "Review this endpoint:\n\n@app.route('/search')\ndef search():\n    q = request.args.get('q')\n    return f'<h1>Results for {q}</h1>'"}, {"role": "assistant", "content": "## Findings\n\n| Severity | Type | Line | Description |\n|---|---|---|---|\n| High | XSS | 4 | User input rendered directly in HTML without escaping. Use template engine with auto-escaping. |\n\n## Remediation\n\n```python\n@app.route('/search')\ndef search():\n    q = request.args.get('q')\n    return render_template('search.html', query=q)\n```"}]}

Dataset Guidelines

Follow these rules to build an effective training set:

Minimum 100 examples — fewer risks overfitting; aim for 100–500 for LoRA
Diverse inputs — cover the full range of inputs the model will encounter
Consistent output format — every example should follow the exact same structure
No contradictions — conflicting examples confuse the model during training
System prompt consistency — use the same system prompt across all examples

Validation Split

Always hold out 10–20% of your data for evaluation. Never train on evaluation data.

import json
import random

with open("dataset.jsonl") as f:
    data = [json.loads(line) for line in f]

random.shuffle(data)
split = int(len(data) * 0.85)

train_data = data[:split]
eval_data = data[split:]

with open("train.jsonl", "w") as f:
    for item in train_data:
        f.write(json.dumps(item) + "\n")

with open("eval.jsonl", "w") as f:
    for item in eval_data:
        f.write(json.dumps(item) + "\n")

print(f"Train: {len(train_data)}, Eval: {len(eval_data)}")

Fine-Tuning with LoRA

LoRA (Low-Rank Adaptation) trains small adapter matrices that modify the model’s behavior without changing the original weights. This reduces memory requirements by 10–100x compared to full fine-tuning.

Comparing Fine-Tuning Approaches

Approach	Parameters Trained	GPU Memory	Training Time	Quality
Full fine-tuning	All (~7B)	60+ GB	Hours	Highest
LoRA	~0.1% (8–32M)	8–16 GB	30–60 min	Near-full
QLoRA	~0.1% (quantized base)	4–8 GB	45–90 min	Good
Prompt tuning	~0.001% (soft tokens)	4 GB	Minutes	Limited

For most use cases, LoRA or QLoRA provides the best tradeoff between quality, cost, and speed.

Training Configuration

from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer

# Load base model
model_name = "meta-llama/Llama-3.1-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Configure LoRA
lora_config = LoraConfig(
    r=16,                      # Rank — higher = more capacity, more memory
    lora_alpha=32,             # Scaling factor
    target_modules=[
        "q_proj", "k_proj",
        "v_proj", "o_proj",
    ],
    lora_dropout=0.05,
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 13,631,488 || all params: 8,043,149,312 || 0.17%

# Training arguments
training_args = TrainingArguments(
    output_dir="./output",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    warmup_steps=50,
    logging_steps=10,
    eval_strategy="steps",
    eval_steps=50,
    save_strategy="steps",
    save_steps=100,
    bf16=True,
)

# Train
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    max_seq_length=2048,
)

trainer.train()

Start with a low learning rate (1e-4 to 2e-4) and 2–3 epochs. Monitor validation loss — if it starts increasing while training loss decreases, you are overfitting.

Key Hyperparameters

LoRA rank (r) — controls adapter capacity. Start with 16; increase to 32–64 for complex tasks.
Learning rate — 1e-4 to 2e-4 for LoRA. Too high causes catastrophic forgetting.
Epochs — 2–3 for most datasets. More epochs with small datasets leads to overfitting.
Batch size — use gradient accumulation to simulate larger batches on limited hardware.

Evaluation

Fine-tuning without rigorous evaluation is guesswork. Define metrics before training and measure them consistently.

Automated Evaluation

Validation loss — the primary training metric. Should decrease and stabilize.
Task-specific accuracy — for classification or extraction tasks, measure precision, recall, and F1.
Format compliance — percentage of outputs that match your expected schema.

Human Evaluation

For open-ended generation tasks, automated metrics are insufficient. Conduct blind comparisons between the base model and fine-tuned model on a held-out test set. Rate outputs on:

Correctness — is the information accurate?
Format adherence — does the output match the expected structure?
Completeness — are all required fields present?

FAQ

When should I fine-tune vs use RAG?

Fine-tune when you need to change the model’s output style, reasoning patterns, or domain expertise. Use RAG when you need to query specific, potentially changing, factual knowledge. Fine-tuning modifies model behavior — how it structures responses, what tone it uses, how it reasons through problems. RAG provides external knowledge at inference time. Many production systems combine both: a fine-tuned model that knows your output format and domain conventions, augmented with RAG for factual grounding.

What is LoRA?

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that trains small adapter matrices instead of modifying all model weights, reducing compute requirements by 10–100x. It works by decomposing weight updates into low-rank matrices that capture task-specific adaptations. The original model weights remain frozen, so you can swap different LoRA adapters in and out to specialize the same base model for different tasks without maintaining multiple full copies.

How much training data do I need?

Quality matters more than quantity. 100–500 high-quality, diverse examples are often sufficient for LoRA fine-tuning on specific tasks. Each example should be a perfect demonstration of the behavior you want. Poorly formatted, inconsistent, or incorrect examples actively harm model performance. For complex tasks like code generation or multi-step reasoning, aim for the higher end. For simpler tasks like classification or extraction, 100 well-chosen examples can produce strong results.