Fine-tuning LLMs on domain-specific data is one of the most effective ways to improve task performance beyond what prompting and RAG can achieve. But full fine-tuning of a 7B+ parameter model requires GPU memory measured in hundreds of gigabytes — well beyond most teams' budgets.

Parameter-efficient fine-tuning (PEFT) methods like LoRA and QLoRA have made fine-tuning accessible. But how much quality do you sacrifice? When should you use LoRA rank 8 vs rank 64? When is full fine-tuning genuinely worth the cost? This post documents our systematic experiments across three production tasks.

The Core Idea: Low-Rank Adaptation

LoRA (Hu et al., 2021) is based on an elegant observation: the weight updates during fine-tuning have low intrinsic rank. Instead of updating the full weight matrix W (d × d), LoRA decomposes the update into two small matrices:

Standard fine-tuning:
  W' = W + ΔW          where ΔW is d × d (millions of params)

LoRA:
  W' = W + BA           where B is d × r,  A is r × d
                         r << d (typically r = 8 to 64)

For LLaMA 7B (d = 4096, r = 16):
  Full update: 4096 × 4096 = 16.8M params per layer
  LoRA update: 4096 × 16 + 16 × 4096 = 131K params per layer
  Reduction: 128× fewer trainable parameters

The base model weights W are frozen. Only the small A and B matrices are trained. At inference time, you merge BA into W — zero additional latency.

Our Experimental Setup

We evaluated three fine-tuning approaches across three tasks:

Approaches

Tasks

  1. IAM Policy Generation: Given a natural language description of permissions needed, generate a valid AWS/GCP/Azure IAM policy JSON. 5K training examples from our EPAM policy database.
  2. Employee Sentiment Classification: Classify employee feedback text into sentiment categories (positive, negative, neutral, mixed). 8K labeled examples from our Amazon HR dataset.
  3. Technical Document Summarization: Summarize AWS documentation pages into concise, actionable summaries. 3K human-written summaries.

Results: LLaMA 2 7B

Task 1: IAM Policy Generation (5K examples)
Method          | Policy Accuracy | Valid JSON Rate | GPU Memory | Train Time
Full FT         |     89.2%       |     97.8%       |   112 GB   |   6.5 hr
LoRA r=16       |     86.4%       |     96.1%       |    18 GB   |   1.8 hr
LoRA r=64       |     88.1%       |     97.2%       |    22 GB   |   2.4 hr
QLoRA r=16      |     85.1%       |     95.4%       |     8 GB   |   2.2 hr
QLoRA r=64      |     87.3%       |     96.8%       |    10 GB   |   3.0 hr

Task 2: Sentiment Classification (8K examples)
Method          | F1 Score | Accuracy | GPU Memory | Train Time
Full FT         |   0.91   |  92.3%   |   112 GB   |   4.2 hr
LoRA r=16       |   0.89   |  91.1%   |    18 GB   |   1.2 hr
LoRA r=64       |   0.90   |  91.8%   |    22 GB   |   1.6 hr
QLoRA r=16      |   0.88   |  90.2%   |     8 GB   |   1.5 hr
QLoRA r=64      |   0.89   |  91.0%   |    10 GB   |   2.0 hr

Task 3: Document Summarization (3K examples)
Method          | ROUGE-L | Human Pref Rate | GPU Memory | Train Time
Full FT         |  0.42   |     78%         |   112 GB   |   8.1 hr
LoRA r=16       |  0.38   |     68%         |    18 GB   |   2.5 hr
LoRA r=64       |  0.41   |     74%         |    22 GB   |   3.2 hr
QLoRA r=16      |  0.36   |     64%         |     8 GB   |   3.0 hr
QLoRA r=64      |  0.39   |     71%         |    10 GB   |   3.8 hr

Analysis: When Does Full Fine-Tuning Win?

Task complexity matters

For classification (Task 2), LoRA r=64 achieved 99% of full fine-tuning quality. The gap was negligible. Classification is a "narrow" task — the model mostly needs to adjust its output distribution, not learn fundamentally new capabilities.

For generation tasks (Task 1, 3), the gap was larger. Policy generation requires learning precise JSON syntax rules and multi-cloud API specifics. Summarization requires learning a new condensation style. These "broad" tasks benefit from updating more parameters.

"Rule of thumb: if your task is classification or extraction (the model already 'knows' how to do it, just needs to learn your categories), LoRA is sufficient. If your task requires generating in a new format or learning new domain-specific patterns, consider higher rank or full fine-tuning."

Rank selection

Higher rank = more capacity = better quality, but diminishing returns:

LoRA Rank Impact (Policy Generation Task):
  Rank  | Params (M) | Accuracy | Delta vs r=16
  r=4   |    0.8     |  83.7%   |   -2.7%
  r=8   |    1.6     |  85.2%   |   -1.2%
  r=16  |    3.1     |  86.4%   |   baseline
  r=32  |    6.3     |  87.4%   |   +1.0%
  r=64  |   12.6     |  88.1%   |   +1.7%
  r=128 |   25.1     |  88.4%   |   +2.0%

Diminishing returns after r=64. r=128 barely improved over r=64
while doubling trainable parameters.

Target modules matter

Which layers you apply LoRA to significantly affects quality:

LoRA Target Module Ablation (r=16, Policy Generation):
  Target Modules        | Accuracy | Trainable Params
  Q, V only             |  84.8%   |    1.6M
  Q, K, V, O            |  86.4%   |    3.1M
  Q, K, V, O + MLP      |  87.6%   |    8.4M
  All linear layers      |  88.0%   |   12.1M

Applying LoRA to MLP layers (gate, up, down projections)
added ~1.2% accuracy at the cost of 2.7× more params.

QLoRA: The Budget Option

QLoRA (Dettmers et al., 2023) quantizes the base model to 4-bit precision and trains LoRA adapters on top. The memory savings are dramatic:

The quality trade-off: QLoRA consistently scored 1-3% below standard LoRA across our tasks. For the sentiment classification task, this gap was negligible (0.89 vs 0.88 F1). For policy generation, it was more noticeable (86.4% vs 85.1%).

Our recommendation: use QLoRA for experimentation and prototyping, LoRA for production fine-tuning. The cost difference between an RTX 4090 and an A100 for a few hours of training isn't worth the quality gap in production systems.

Training Recipes That Worked

# Our standard LoRA fine-tuning configuration
from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=32,                              # rank (sweet spot for most tasks)
    lora_alpha=64,                     # scaling (2× rank is a safe default)
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

# Training hyperparameters
training_args = {
    "learning_rate": 2e-4,             # higher than full FT (1e-5 to 5e-5)
    "warmup_ratio": 0.03,
    "num_train_epochs": 3,             # 3-5 epochs, watch val loss closely
    "per_device_train_batch_size": 4,
    "gradient_accumulation_steps": 8,  # effective batch size = 32
    "bf16": True,                      # always use bf16 on Ampere+
    "optim": "adamw_torch",
    "weight_decay": 0.01,
    "max_grad_norm": 1.0,
}

Common Pitfalls

When to Use What: Decision Framework

  1. Prompting + RAG first — if you can get acceptable quality without fine-tuning, do that. Fine-tuning adds training infrastructure, data management, and model versioning overhead.
  2. QLoRA for rapid experiments — test whether fine-tuning helps at all. Runs on consumer GPUs. If QLoRA shows no improvement, full fine-tuning won't either.
  3. LoRA r=32-64 for production — the sweet spot. 95-98% of full fine-tuning quality at 6-10× lower cost. This is what we deploy.
  4. Full fine-tuning for critical tasks — when you need every last percent of accuracy and have the GPU budget. Policy generation for financial or security contexts. Medical summarization. Tasks where errors have high cost.
"LoRA didn't just make fine-tuning cheaper — it made it practical. Before LoRA, fine-tuning a 7B model was a multi-GPU, multi-day affair that most teams couldn't justify. Now it's a single-GPU, few-hour experiment. That changes the economics of when fine-tuning is worthwhile."