Rutvik Acharya

Back

For most of 2022, fine-tuning large language models felt like something only labs and well-funded companies could do. The recipe was expensive: load the whole model, update billions of parameters, keep optimizer states in memory, and hope your GPUs did not melt the budget.

Parameter-efficient fine-tuning changed the conversation.

LoRA made it possible to adapt models by training small low-rank adapter matrices instead of updating every parameter. QLoRA pushed the idea further by combining quantization with LoRA, making it realistic to fine-tune useful open-weight models on much smaller hardware.

That mattered because the question changed from “Can we train a model?” to “Can we cheaply specialize a model for our task?”

The Problem With Full Fine-Tuning#

Full fine-tuning updates all model weights. For a 7B parameter model, that means billions of trainable values plus gradients and optimizer state. With Adam-style optimizers, memory usage can balloon far beyond the raw model size.

That is painful for three reasons:

  • You need expensive GPUs
  • Every task produces a full copy of the model
  • Experimentation becomes slow and risky

If all you want is a support assistant that follows a company’s tone and schema, updating every parameter is overkill.

LoRA In Plain English#

LoRA stands for Low-Rank Adaptation. Instead of changing the original weight matrix directly, LoRA freezes the base model and learns a small update.

Conceptually:

original output = W x
adapted output = W x + B A x
text

The original matrix W stays frozen. The trainable matrices A and B are much smaller because they are low rank.

That means:

  • Fewer trainable parameters
  • Less GPU memory
  • Faster experiments
  • Small adapter files that can be swapped in and out

The base model keeps its general language ability. The adapter nudges it toward a specific behavior.

What QLoRA Added#

QLoRA made the frozen base model cheaper to keep in memory by loading it in low precision, commonly 4-bit quantization, while still training LoRA adapters.

The rough idea:

  • Store the base model in a compressed 4-bit representation
  • Freeze those quantized weights
  • Train small LoRA adapter weights
  • Backpropagate through the quantized model into the adapters

This made fine-tuning a 7B or 13B class model far more accessible. It did not make training free, and it did not remove the need for good data, but it changed what an individual developer or small team could attempt.

The important nuance is that QLoRA does not train a tiny model. It trains a small number of adapter parameters while still using the representational power of the larger frozen model. You are not asking a 300M parameter model to become a domain expert. You are taking a capable base model and teaching it a new behavioral layer.

That is why QLoRA works best when the base model already has the underlying capability. If the model can already read contracts, classify support tickets, or produce JSON with prompting, an adapter can make that behavior more reliable. If the base model cannot do the task at all, QLoRA will not magically add deep reasoning ability.

Rank, Alpha, and Where To Attach LoRA#

LoRA has a few knobs that matter more than they first appear.

The rank r controls the size of the adapter matrices. Higher rank gives the adapter more capacity, but also increases memory use and overfitting risk.

r = 4 or 8: small behavior nudge
r = 16 or 32: common general-purpose range
r = 64+: more capacity, more risk, more memory
text

lora_alpha scales the adapter contribution. A high alpha makes the adapter more aggressive. A low alpha keeps it more conservative.

You also choose which modules receive adapters. Many LLM fine-tunes target attention projection layers like q_proj, k_proj, v_proj, and o_proj. Some recipes also adapt MLP projection layers for more capacity.

There is no universal best setting. The practical method is to start with a known stable recipe, hold out evals, and change one knob at a time.

When Fine-Tuning Beats Prompting#

Prompting is still the first tool to reach for. It is faster, reversible, and does not require a training pipeline.

Fine-tuning starts to make sense when the behavior is repetitive and hard to express compactly in a prompt.

Good candidates:

  • Consistent output schemas
  • Domain-specific extraction
  • Style and tone adaptation
  • Classification with unusual labels
  • Multi-turn behavior patterns
  • Following narrow business rules

Bad candidates:

  • Teaching new factual knowledge that changes often
  • Fixing poor retrieval
  • Making a weak base model reason like a much stronger model
  • Solving a problem you cannot evaluate

Fine-tuning changes behavior. Retrieval supplies knowledge. Mixing those up is one of the most common mistakes teams make with LLM systems.

The Data Matters More Than The Method#

LoRA made fine-tuning cheaper, but it did not make bad datasets useful.

A small, clean dataset often beats a large messy one. For instruction tuning, the examples should show exactly the behavior you want:

{
  "instruction": "Extract the renewal date and cancellation window.",
  "input": "The subscription renews on March 15. Cancellation requires 30 days notice.",
  "output": {
    "renewal_date": "March 15",
    "cancellation_notice_days": 30
  }
}
json

The model learns the pattern from the examples. If the examples are inconsistent, verbose in random places, or full of contradictory labels, the adapter learns that too.

Dataset formatting is part of the training signal. If production prompts use a chat format, train in a chat format. If the model should answer with strict JSON, every example should demonstrate strict JSON. If refusals matter, include refusals.

For an assistant-style model, a training example should usually include:

  • A system message that resembles production
  • A user request
  • Optional retrieved context or tool output
  • The exact assistant behavior you want
{
  "messages": [
    {"role": "system", "content": "You extract renewal terms from contracts."},
    {"role": "user", "content": "Extract the renewal date and notice period.\n\nContract: ..."},
    {"role": "assistant", "content": "{\"renewal_date\":\"March 15\",\"notice_days\":30}"}
  ]
}
json

The easiest way to ruin a fine-tune is to mix incompatible behaviors: some examples answer in prose, some in JSON, some include citations, some do not. The model averages the mess.

Overfitting Looks Different For LLMs#

Classic overfitting means the model does well on training data and poorly on held-out data. That still applies, but LLM fine-tuning adds behavioral overfitting.

Symptoms:

  • The model uses the same phrasing too often
  • It follows the training schema even when the user asks for something else
  • It becomes worse at general instruction following
  • It refuses too often because the dataset over-represented refusals
  • It memorizes sensitive strings from examples

This is why the held-out set should include realistic variation. Do not only test examples that look like your training rows. Include messy user phrasing, missing fields, irrelevant context, and adversarial formatting.

Also keep an eye on base-model regression. After training, test a few general capabilities you still need: summarization, simple reasoning, following length constraints, and tool-call formatting.

A Practical Workflow#

A good lightweight workflow looked like this:

  1. Start with a capable open-weight base model
  2. Build 200 to 2,000 high-quality examples
  3. Hold out a small evaluation set
  4. Train a LoRA or QLoRA adapter
  5. Compare against prompting-only baseline
  6. Test on real edge cases before deployment

The baseline is important. Many teams fine-tuned before checking whether a better prompt or retrieval setup solved the problem.

In practice, the comparison should include at least three candidates:

CandidateWhat it tells you
Base model + best promptWhether fine-tuning is needed at all
Base model + few-shot promptWhether examples in context are enough
Fine-tuned adapter + shorter promptWhether training improved reliability or cost

That last candidate is often the win. Fine-tuning may not produce a dramatically smarter model, but it can reduce prompt length, stabilize formatting, and lower per-request cost.

Serving Adapters#

Training is only half the story. You also need to serve the adapter.

There are two common deployment patterns:

  • Merged model: combine the adapter with the base weights and serve one model artifact
  • Dynamic adapter: load a base model once and swap adapters per task or tenant

Merged models are simpler operationally. Dynamic adapters are more flexible if you have many specialized behaviors. For example, one base model might serve separate adapters for contract extraction, support triage, and SQL query drafting.

The tradeoff is complexity. Dynamic adapter routing requires careful versioning, warmup, memory planning, and evals per adapter. It is powerful, but it can become its own platform.

Where It Still Fails#

QLoRA lowered the barrier, but it did not remove the hard parts.

Fine-tuned models can overfit. They can become worse at general instruction following. They can learn formatting quirks from the dataset. They can appear better on examples similar to training data while failing on realistic inputs.

There is also deployment complexity. Serving a quantized model with adapters is easier than full fine-tuning, but it is still more operational work than calling an API.

The Takeaway#

LoRA and QLoRA were a turning point because they made customization feel practical.

They did not replace prompting, retrieval, or evaluation. They gave builders another lever: when the base model is broadly capable but not quite shaped for your workflow, train a small adapter instead of dragging the whole model through a full fine-tune.

The right question is not “Should we fine-tune?” It is:

Do we have a repeated behavior, enough clean examples, and an eval that proves the adapter helps?

If yes, QLoRA made that experiment cheap enough to try.