Concept

Mixture of Experts (MoE)

An architecture that routes each token through only a small fraction of the model's total parameters - enabling massive scale without proportionally massive compute costs.

Added May 18, 2026

Training a larger model almost always produces a better model - but it also costs proportionally more to run. A model with twice as many parameters needs roughly twice as much compute per token. This trade-off is what Mixture of Experts (MoE) was designed to break.

In an MoE model, the dense feed-forward layers found in standard transformers are replaced with a collection of independent "expert" networks - typically 8, 16, 64, or more parallel networks. A learned routing function, called the gating network, looks at each token and selects a small number of experts (usually just two) to process it. Only those selected experts activate; the rest contribute nothing for that token. The outputs of the active experts are combined, weighted by the gating scores.

The result is a model that has a very large number of total parameters - the sum of all experts - but processes each token through only a small fraction of them. A model with 64 experts and top-2 routing activates only 2/64 of its parameters per token. The total parameter count can be enormous while the compute per token stays manageable.

This matters because parameters are a proxy for knowledge capacity. A model with more parameters can store more facts, relationships, and reasoning patterns. MoE lets you scale up that knowledge capacity - making the model "know more" - without proportionally scaling the cost of every inference call. Different experts appear to specialise naturally on different types of content: some handle code better, others handle factual questions, others handle reasoning tasks.

The engineering challenges are real. Routing tokens across many experts introduces communication overhead, especially in distributed training where experts may live on different hardware. Load balancing is non-trivial - you want all experts used roughly equally, not all traffic flooding two popular experts while the rest sit idle. Modern MoE implementations include auxiliary losses during training specifically to encourage balanced routing.

Analogy

A hospital with many specialised consultants. When a patient arrives, a triage system reads their case and routes them to the two most relevant specialists. The GP does not need to master every speciality - that breadth is covered by the collective of consultants. The patient receives highly specialised attention without the entire hospital mobilising for every case.

Real-world example

Mistral's Mixtral 8x7B model demonstrated the practical power of MoE: it matched or exceeded the performance of models twice its size in terms of active parameters, because the 46 billion total parameters (8 experts x 7 billion parameters each, roughly) gave it far greater knowledge capacity than a 13 billion dense model. Google's Gemini models and GPT-4 are also widely believed to use MoE architectures, though details are not officially confirmed.

Why it matters

MoE is one of the most important efficiency advances in large language models. It breaks the linear relationship between model capability and inference cost, which is what makes it possible to build extremely capable models that are still economically viable to run. As model providers compete on both quality and cost, MoE architectures will become increasingly prevalent.

In the news

No recent coverage - check back later.

Related concepts

Decoder-Only Architecture Inference Transformer

← Back to concepts