LoRA (Low-Rank Adaptation)

The most widely used technique for efficiently fine-tuning large language models - adapting billions of parameters to new tasks by updating only a tiny fraction of the total weight count.

Added May 18, 2026 · 3 min read

LoRA became the standard method for community-driven model adaptation because it is efficient, accessible, and produces adapter files small enough to share easily. The ability to layer adapters on top of base models without modifying them - and to potentially combine multiple adapters at inference time - opens up a flexible ecosystem for specialised model variants that would be impractical if every customisation required a full model copy.

When you fine-tune a large language model by updating all its parameters, you are effectively asking: what is the specific change to every weight in this model that makes it perform better on this task? For a 70-billion parameter model, that is 70 billion numbers you need to find and update.

LoRA is based on the empirical observation that the weight changes needed to adapt a model to a specific task have low intrinsic rank. In linear algebra terms, a low-rank matrix is one that can be expressed as the product of two much smaller matrices. Instead of representing a 4096 x 4096 weight change (16 million numbers), you can approximate it as the product of a 4096 x 16 and a 16 x 4096 matrix (131,000 numbers) - a 99% reduction in parameters, with surprisingly little loss in quality.

In practice, LoRA adds these small matrix pairs alongside key weight matrices in the model (typically the query and value projection weights in attention). The original weights are frozen completely. Only the small adapter matrices are updated during training. When you compute the model's output, you add the adapter contribution (the product of the two small matrices) to the frozen weight matrix's contribution.

The "rank" in low-rank refers to the size of the intermediate dimension (16 in the example above). Lower rank means fewer parameters and lower memory use, but potentially lower adaptation quality. Higher rank captures more nuanced adaptations but costs more. Rank is a hyperparameter you choose based on your task and compute budget. Common choices are 4, 8, 16, or 32.

After training, you can either keep the LoRA weights separate and load them on top of the base model, or merge them into the base model weights permanently. The separate approach means you can share adapters without sharing the full model, and combine multiple adapters for different purposes at inference time.

Analogy

Learning a new dialect of a language you already speak fluently. You do not rebuild your entire understanding of the language from scratch - you add a focused set of new patterns and usages on top of what you already know. LoRA does this for neural networks: it preserves all the existing knowledge in the frozen base model and adds a small targeted overlay of new patterns.

Real-world example

When the research community needed to evaluate whether large models like LLaMA could be fine-tuned for instruction following, researchers used LoRA to produce "Alpaca" and "Vicuna" models - fine-tuned versions of LLaMA that behaved like conversational assistants. The LoRA training ran on a single GPU over a few hours, compared to the weeks of multi-GPU training that produced the base models. The resulting models were qualitatively competitive with early GPT-3.5 for many tasks.

Why it matters

LoRA became the standard method for community-driven model adaptation because it is efficient, accessible, and produces adapter files small enough to share easily. The ability to layer adapters on top of base models without modifying them - and to potentially combine multiple adapters at inference time - opens up a flexible ecosystem for specialised model variants that would be impractical if every customisation required a full model copy.

In the news

No recent coverage - search for LoRA (Low-Rank Adaptation).

Related concepts

Fine-tuning

Taking a general-purpose AI model and giving it additional training on a specific subject, so it becomes noticeably better at that particular domain.

Foundation Model

A large AI model trained on vast amounts of general data, designed to be the starting point for many different applications rather than built for a single task.

Parameter-Efficient Fine-Tuning (PEFT)

A family of techniques for adapting large language models to specific tasks by updating only a small fraction of their parameters - making fine-tuning accessible without massive compute budgets.

← Back to concepts