Concept

Inference

The process of actually using an AI model to get an answer - as distinct from the training process that built the model in the first place.

Building an AI model and using an AI model are two completely different operations, with different costs, different infrastructure, and different challenges. Training is the process of building the model - feeding it enormous amounts of data, adjusting its internal parameters billions of times until it develops useful capabilities. This happens once (or occasionally, for updates). Inference is everything that happens after: every time someone sends a prompt and gets a response.

When you send a message to an AI, what happens is straightforward in concept but computationally intensive in practice. The model takes your input, processes it through many layers of mathematical operations, and produces an output - one token at a time, generated sequentially until the response is complete. This happens on specialised hardware, typically graphics cards or custom AI chips, that can perform the required calculations very quickly.

Speed and cost of inference are the dominant economic challenges for AI companies. Training a frontier model costs hundreds of millions of dollars - but that is a one-time expense. Inference costs accumulate with every single request, across millions of users, every day. A response that takes three seconds to generate at low cost is a very different business than one that takes ten seconds and costs ten times as much. Getting inference fast and cheap is one of the most important engineering challenges in the industry.

This is why there is so much work on inference optimisation - techniques for making models faster and cheaper to run without making them worse at their tasks. Some approaches compress the model to use less memory. Others generate multiple possible next tokens in parallel and pick the best one. Custom hardware designed specifically for AI inference, rather than repurposed graphics chips, is another active area.

Latency - how fast the model responds - matters enormously for user experience. An AI assistant that takes 30 seconds to respond to every message is frustrating to use, regardless of how good the answers are. The race to make inference faster is not just a cost-reduction exercise; it is directly connected to whether AI products feel good to use.

Analogy

Baking bread versus eating bread. Training is the baking - intensive, happens in advance, done once. Inference is every time someone takes a slice - fast, repetitive, where the ongoing cost accumulates. The oven runs once; the bread gets eaten continuously.

Real-world example

When you send a message to Claude or ChatGPT, the model's underlying parameters are fixed - no learning is happening. The model is just computing: given this input, what output should follow? It does this one small piece at a time until your response is complete. That computation is inference.

Why it matters

The business economics of AI products are primarily inference economics. Training is a capital expense. Inference is an ongoing operational cost that scales with every user interaction. Driving inference cost down is what makes AI products financially sustainable - and what determines whether AI can reach hundreds of millions of users or stays expensive and niche.

In the news

Related concepts

Context Window Foundation Model Token

← Back to concepts