latentbrief
← Back to concepts

Grouped Query Attention (GQA)

A more memory-efficient variant of multi-head attention where multiple query heads share a single set of key-value pairs - cutting memory use without meaningfully hurting quality.

Added May 18, 2026 · 2 min read

GQA represents an important shift in how the field thinks about model design: quality on benchmarks is not the only axis that matters. A model that achieves 99% of the performance of the best architecture at 25% of the memory cost is a substantially better product for the people who actually have to run it. Architectural efficiency has become a first-class design concern.

Standard multi-head attention maintains a separate set of key and value vectors for every attention head. If a model has 32 attention heads, it stores 32 complete sets of key-value pairs in the KV-Cache for every token in the context. For models with long contexts or large batch sizes, this becomes a significant memory burden.

Grouped Query Attention is a middle-ground solution between two extremes. Multi-head attention (MHA) gives every query head its own keys and values - maximum expressiveness, maximum memory. Multi-query attention (MQA) goes to the other extreme: all query heads share a single set of keys and values - minimum memory, but some loss of representational richness. GQA groups the query heads into clusters, with each group sharing a set of keys and values. A model with 32 query heads might have 8 key-value heads, with every 4 query heads sharing one KV set.

The practical result is a dramatic reduction in KV-Cache size - typically 4x to 8x smaller than full multi-head attention - while maintaining nearly all of the quality. Empirically, grouped query attention models perform almost identically to full multi-head attention models on standard benchmarks, while being significantly cheaper to run at inference time.

GQA has been adopted by most recent high-performance open models, including the LLaMA 3 family, Mistral, and Qwen. It is one of a cluster of architectural improvements that emerged specifically from the need to make large models practical to deploy, rather than from a desire to improve quality on benchmarks. The improvements are invisible to users but directly determine whether a given model can be served at reasonable cost.

Analogy

A group of detectives sharing evidence boards. In standard multi-head attention, every detective has their own dedicated evidence board. In grouped query attention, pairs of detectives share one board between them. They bring different questions to the same evidence, and slightly less information is available per detective, but the investigation proceeds nearly as effectively while using half the wall space.

Real-world example

LLaMA 3, Meta's widely used open-source model family, uses grouped query attention with 8 key-value heads for its 70-billion parameter model. This makes the model feasible to run on a single server with 4 high-end GPUs rather than requiring 8, directly affecting who can afford to deploy it and at what scale.

Why it matters

GQA represents an important shift in how the field thinks about model design: quality on benchmarks is not the only axis that matters. A model that achieves 99% of the performance of the best architecture at 25% of the memory cost is a substantially better product for the people who actually have to run it. Architectural efficiency has become a first-class design concern.

In the news

No recent coverage - search for Grouped Query Attention (GQA).

Related concepts