Concept
LoRA (Low-Rank Adaptation)
The most widely used technique for efficiently fine-tuning large language models - adapting billions of parameters to new tasks by updating only a tiny fraction of the total weight count.
Added May 18, 2026
When you fine-tune a large language model by updating all its parameters, you are effectively asking: what is the specific change to every weight in this model that makes it perform better on this task? For a 70-billion parameter model, that is 70 billion numbers you need to find and update.
LoRA is based on the empirical observation that the weight changes needed to adapt a model to a specific task have low intrinsic rank. In linear algebra terms, a low-rank matrix is one that can be expressed as the product of two much smaller matrices. Instead of representing a 4096 x 4096 weight change (16 million numbers), you can approximate it as the product of a 4096 x 16 and a 16 x 4096 matrix (131,000 numbers) - a 99% reduction in parameters, with surprisingly little loss in quality.
In practice, LoRA adds these small matrix pairs alongside key weight matrices in the model (typically the query and value projection weights in attention). The original weights are frozen completely. Only the small adapter matrices are updated during training. When you compute the model's output, you add the adapter contribution (the product of the two small matrices) to the frozen weight matrix's contribution.
The "rank" in low-rank refers to the size of the intermediate dimension (16 in the example above). Lower rank means fewer parameters and lower memory use, but potentially lower adaptation quality. Higher rank captures more nuanced adaptations but costs more. Rank is a hyperparameter you choose based on your task and compute budget. Common choices are 4, 8, 16, or 32.
After training, you can either keep the LoRA weights separate and load them on top of the base model, or merge them into the base model weights permanently. The separate approach means you can share adapters without sharing the full model, and combine multiple adapters for different purposes at inference time.
Analogy
Learning a new dialect of a language you already speak fluently. You do not rebuild your entire understanding of the language from scratch - you add a focused set of new patterns and usages on top of what you already know. LoRA does this for neural networks: it preserves all the existing knowledge in the frozen base model and adds a small targeted overlay of new patterns.
Real-world example
When the research community needed to evaluate whether large models like LLaMA could be fine-tuned for instruction following, researchers used LoRA to produce "Alpaca" and "Vicuna" models - fine-tuned versions of LLaMA that behaved like conversational assistants. The LoRA training ran on a single GPU over a few hours, compared to the weeks of multi-GPU training that produced the base models. The resulting models were qualitatively competitive with early GPT-3.5 for many tasks.
Why it matters
LoRA became the standard method for community-driven model adaptation because it is efficient, accessible, and produces adapter files small enough to share easily. The ability to layer adapters on top of base models without modifying them - and to potentially combine multiple adapters at inference time - opens up a flexible ecosystem for specialised model variants that would be impractical if every customisation required a full model copy.
In the news
No recent coverage - check back later.
Related concepts