Fine-tuning LLMs on domain-specific data is one of the most effective ways to improve task performance beyond what prompting and RAG can achieve. But full fine-tuning of a 7B+ parameter model requires GPU memory measured in hundreds of gigabytes — well beyond most teams' budgets.
Parameter-efficient fine-tuning (PEFT) methods like LoRA and QLoRA have made fine-tuning accessible. But how much quality do you sacrifice? When should you use LoRA rank 8 vs rank 64? When is full fine-tuning genuinely worth the cost? This post documents our systematic experiments across three production tasks.
The Core Idea: Low-Rank Adaptation
LoRA (Hu et al., 2021) is based on an elegant observation: the weight updates during fine-tuning have low intrinsic rank. Instead of updating the full weight matrix W (d × d), LoRA decomposes the update into two small matrices:
Standard fine-tuning:
W' = W + ΔW where ΔW is d × d (millions of params)
LoRA:
W' = W + BA where B is d × r, A is r × d
r << d (typically r = 8 to 64)
For LLaMA 7B (d = 4096, r = 16):
Full update: 4096 × 4096 = 16.8M params per layer
LoRA update: 4096 × 16 + 16 × 4096 = 131K params per layer
Reduction: 128× fewer trainable parameters
The base model weights W are frozen. Only the small A and B matrices are trained. At inference time, you merge BA into W — zero additional latency.
Our Experimental Setup
We evaluated three fine-tuning approaches across three tasks:
Approaches
- Full fine-tuning: All parameters unfrozen. Requires 4× model size in GPU memory (weights + optimizer states + gradients).
- LoRA: Base model frozen in FP16. Only LoRA adapters trained. Applied to Q, K, V, and O projection matrices in all transformer layers.
- QLoRA: Base model quantized to 4-bit NormalFloat (NF4). LoRA adapters trained in FP16 on top of the quantized model. Dramatically reduces memory.
Tasks
- IAM Policy Generation: Given a natural language description of permissions needed, generate a valid AWS/GCP/Azure IAM policy JSON. 5K training examples from our EPAM policy database.
- Employee Sentiment Classification: Classify employee feedback text into sentiment categories (positive, negative, neutral, mixed). 8K labeled examples from our Amazon HR dataset.
- Technical Document Summarization: Summarize AWS documentation pages into concise, actionable summaries. 3K human-written summaries.
Results: LLaMA 2 7B
Task 1: IAM Policy Generation (5K examples)
Method | Policy Accuracy | Valid JSON Rate | GPU Memory | Train Time
Full FT | 89.2% | 97.8% | 112 GB | 6.5 hr
LoRA r=16 | 86.4% | 96.1% | 18 GB | 1.8 hr
LoRA r=64 | 88.1% | 97.2% | 22 GB | 2.4 hr
QLoRA r=16 | 85.1% | 95.4% | 8 GB | 2.2 hr
QLoRA r=64 | 87.3% | 96.8% | 10 GB | 3.0 hr
Task 2: Sentiment Classification (8K examples)
Method | F1 Score | Accuracy | GPU Memory | Train Time
Full FT | 0.91 | 92.3% | 112 GB | 4.2 hr
LoRA r=16 | 0.89 | 91.1% | 18 GB | 1.2 hr
LoRA r=64 | 0.90 | 91.8% | 22 GB | 1.6 hr
QLoRA r=16 | 0.88 | 90.2% | 8 GB | 1.5 hr
QLoRA r=64 | 0.89 | 91.0% | 10 GB | 2.0 hr
Task 3: Document Summarization (3K examples)
Method | ROUGE-L | Human Pref Rate | GPU Memory | Train Time
Full FT | 0.42 | 78% | 112 GB | 8.1 hr
LoRA r=16 | 0.38 | 68% | 18 GB | 2.5 hr
LoRA r=64 | 0.41 | 74% | 22 GB | 3.2 hr
QLoRA r=16 | 0.36 | 64% | 8 GB | 3.0 hr
QLoRA r=64 | 0.39 | 71% | 10 GB | 3.8 hr
Analysis: When Does Full Fine-Tuning Win?
Task complexity matters
For classification (Task 2), LoRA r=64 achieved 99% of full fine-tuning quality. The gap was negligible. Classification is a "narrow" task — the model mostly needs to adjust its output distribution, not learn fundamentally new capabilities.
For generation tasks (Task 1, 3), the gap was larger. Policy generation requires learning precise JSON syntax rules and multi-cloud API specifics. Summarization requires learning a new condensation style. These "broad" tasks benefit from updating more parameters.
"Rule of thumb: if your task is classification or extraction (the model already 'knows' how to do it, just needs to learn your categories), LoRA is sufficient. If your task requires generating in a new format or learning new domain-specific patterns, consider higher rank or full fine-tuning."
Rank selection
Higher rank = more capacity = better quality, but diminishing returns:
LoRA Rank Impact (Policy Generation Task):
Rank | Params (M) | Accuracy | Delta vs r=16
r=4 | 0.8 | 83.7% | -2.7%
r=8 | 1.6 | 85.2% | -1.2%
r=16 | 3.1 | 86.4% | baseline
r=32 | 6.3 | 87.4% | +1.0%
r=64 | 12.6 | 88.1% | +1.7%
r=128 | 25.1 | 88.4% | +2.0%
Diminishing returns after r=64. r=128 barely improved over r=64
while doubling trainable parameters.
Target modules matter
Which layers you apply LoRA to significantly affects quality:
LoRA Target Module Ablation (r=16, Policy Generation):
Target Modules | Accuracy | Trainable Params
Q, V only | 84.8% | 1.6M
Q, K, V, O | 86.4% | 3.1M
Q, K, V, O + MLP | 87.6% | 8.4M
All linear layers | 88.0% | 12.1M
Applying LoRA to MLP layers (gate, up, down projections)
added ~1.2% accuracy at the cost of 2.7× more params.
QLoRA: The Budget Option
QLoRA (Dettmers et al., 2023) quantizes the base model to 4-bit precision and trains LoRA adapters on top. The memory savings are dramatic:
- LLaMA 7B full FT: 112 GB (need 2× A100 80GB)
- LoRA on FP16 base: 18 GB (1× A100 40GB)
- QLoRA on NF4 base: 8 GB (1× RTX 4090 or even RTX 3090)
The quality trade-off: QLoRA consistently scored 1-3% below standard LoRA across our tasks. For the sentiment classification task, this gap was negligible (0.89 vs 0.88 F1). For policy generation, it was more noticeable (86.4% vs 85.1%).
Our recommendation: use QLoRA for experimentation and prototyping, LoRA for production fine-tuning. The cost difference between an RTX 4090 and an A100 for a few hours of training isn't worth the quality gap in production systems.
Training Recipes That Worked
# Our standard LoRA fine-tuning configuration
from peft import LoraConfig, get_peft_model
lora_config = LoraConfig(
r=32, # rank (sweet spot for most tasks)
lora_alpha=64, # scaling (2× rank is a safe default)
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"
],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
# Training hyperparameters
training_args = {
"learning_rate": 2e-4, # higher than full FT (1e-5 to 5e-5)
"warmup_ratio": 0.03,
"num_train_epochs": 3, # 3-5 epochs, watch val loss closely
"per_device_train_batch_size": 4,
"gradient_accumulation_steps": 8, # effective batch size = 32
"bf16": True, # always use bf16 on Ampere+
"optim": "adamw_torch",
"weight_decay": 0.01,
"max_grad_norm": 1.0,
}
Common Pitfalls
- Learning rate too low: LoRA adapters train differently than full parameters. Use 2e-4 to 5e-4 for LoRA vs 1e-5 to 5e-5 for full fine-tuning. Low learning rate = the adapters barely change.
- Too many epochs: LoRA overfits faster than full fine-tuning because fewer parameters memorize the training set more easily. Monitor validation loss and stop at first sign of increase.
- Forgetting the alpha scaling: lora_alpha / r determines the effective learning rate scaling. Setting alpha = rank is conservative. We found alpha = 2× rank worked best.
- Not targeting enough layers: Q and V only (the original paper's recommendation) leaves performance on the table. Include K, O, and MLP projections for best results.
When to Use What: Decision Framework
- Prompting + RAG first — if you can get acceptable quality without fine-tuning, do that. Fine-tuning adds training infrastructure, data management, and model versioning overhead.
- QLoRA for rapid experiments — test whether fine-tuning helps at all. Runs on consumer GPUs. If QLoRA shows no improvement, full fine-tuning won't either.
- LoRA r=32-64 for production — the sweet spot. 95-98% of full fine-tuning quality at 6-10× lower cost. This is what we deploy.
- Full fine-tuning for critical tasks — when you need every last percent of accuracy and have the GPU budget. Policy generation for financial or security contexts. Medical summarization. Tasks where errors have high cost.
"LoRA didn't just make fine-tuning cheaper — it made it practical. Before LoRA, fine-tuning a 7B model was a multi-GPU, multi-day affair that most teams couldn't justify. Now it's a single-GPU, few-hour experiment. That changes the economics of when fine-tuning is worthwhile."