latentbrief
← Back to concepts

Concept

SwiGLU

The activation function used inside the feed-forward layers of most modern language models - a small but significant architectural detail that measurably improves model quality.

Added May 18, 2026

Every transformer layer contains two main components: the attention mechanism, which allows tokens to communicate with each other, and a feed-forward network, which processes each token independently. The feed-forward network is where the model does most of its "thinking" about individual tokens - transforming their representations in complex non-linear ways that encode the model's learned knowledge.

At the heart of any feed-forward network is an activation function - a non-linear transformation that gives the network its expressive power. If activation functions were removed, the whole stack of layers would collapse into a single linear transformation, incapable of learning complex relationships. The choice of activation function has a measurable effect on how well the model learns.

SwiGLU combines two ideas: the Swish activation function (which is a smooth, differentiable alternative to the popular ReLU, producing slightly better gradients) and Gated Linear Units (GLU), which add a learned multiplicative gate to the transformation. Instead of simply applying a fixed non-linear transformation, a GLU uses one linear transformation to compute an input and another to compute a gate, then multiplies them together. The gate learns to selectively allow or block information flowing through the network at each position.

The combination produces a feed-forward transformation that is slightly more expensive than vanilla alternatives but empirically better. Google's PaLM paper, which popularised SwiGLU for large language models, showed consistent improvements in perplexity (a measure of how well the model predicts text) compared to standard alternatives. Meta's LLaMA models adopted SwiGLU, and it has since become standard in essentially all high-performance open models.

Analogy

A very fine-grained filter in a water purification system - one that does not just allow or block water, but adjusts its own mesh dynamically based on what it detects in the water. Standard activation functions have a fixed shape; SwiGLU's gating mechanism adapts what it passes based on learned patterns, giving the network more nuanced control over information flow.

Real-world example

When researchers ablate architectural choices in language models - replacing SwiGLU with older activation functions while keeping everything else equal - they consistently see measurable degradation in the model's performance on language benchmarks. The effect is small per layer but compounds across dozens of layers. For models of any significant scale, SwiGLU is simply the better choice.

Why it matters

SwiGLU is an example of how language model improvements often come from many small, carefully validated choices rather than single dramatic breakthroughs. No single change makes an enormous difference, but a model that gets 10 small details right will substantially outperform one that gets them wrong. This kind of architectural hygiene is one reason well-resourced labs maintain persistent advantages.

In the news

Related concepts