Inference
Using a trained AI model to make predictions or generate outputs - the fast, cheap counterpart to training's slow, expensive computation.
Added May 17, 2026 · 2 min read
Inference is the moment AI becomes useful to people. Understanding the distinction between training (where models learn) and inference (where they are used) clears up common misconceptions - like whether AI learns from your conversations (most do not, because inference does not update weights) and why AI models have knowledge cutoffs (their knowledge is fixed at training time).
Once a model has been trained, the parameters are set. Inference is the act of running the model with those fixed parameters to produce outputs for new inputs. It is what happens when you send a message to ChatGPT, generate an image on Midjourney, or get a product recommendation on an e-commerce site. The model takes your input, passes it through its fixed mathematical operations, and produces an output.
Inference is orders of magnitude cheaper than training. A model like GPT-4 costs hundreds of millions of dollars and months of compute to train. Each inference request costs fractions of a cent and takes seconds. The training cost is amortised across billions of inference calls made by users worldwide.
This cost asymmetry has significant implications. Large AI companies can afford to train enormous models once, then serve them to millions of users at low per-query cost. Smaller organisations can run inference on models they did not train themselves - using open-source models from Hugging Face or API access from OpenAI, Anthropic, or Google - without needing the resources to train.
The primary computational operation in inference is the forward pass: data flows through the network's layers, multiplying inputs by weights and applying activation functions at each step, until the final layer produces an output. During training, an additional backward pass computes gradients and updates weights. Inference is just the forward pass - no gradient computation, no weight updates.
Inference latency - how fast the model can produce a response - matters a great deal for user experience. This drives significant engineering investment: quantisation, batching, custom hardware, and caching all reduce inference cost and latency. Smaller models that respond in milliseconds often win over larger, more capable models that take seconds.
Analogy
A musician who spends years practicing a piece (training) versus performing it in concert (inference). The practice is slow, iterative, effortful. The performance is fast, polished, and drawing entirely on what practice built. Inference is the performance.
Real-world example
When you ask a question on Claude.ai, inference is what happens. Claude's parameters were fixed after training; inference runs your question through those parameters to generate an answer. The same parameters handle every question every user asks - there is no learning happening in real time.
Why it matters
Inference is the moment AI becomes useful to people. Understanding the distinction between training (where models learn) and inference (where they are used) clears up common misconceptions - like whether AI learns from your conversations (most do not, because inference does not update weights) and why AI models have knowledge cutoffs (their knowledge is fixed at training time).
In the news
Related concepts
Machine Learning
A way of teaching computers by showing them examples, rather than writing explicit rules - the engine behind almost everything labelled AI today.
Parameters
The numbers inside a neural network that get adjusted during training and define everything the model knows and can do.
Training
The process of teaching an AI model by adjusting its internal parameters until it gets better at its task - the computational work that creates intelligence.