Test-Time Compute

Spending more computation during inference - at the moment of answering - to improve quality, rather than only investing compute during training.

Added May 18, 2026 · 3 min read

Test-time compute represents a fundamental shift in how AI capability is understood and developed. It decouples capability from model size to some degree, opening up new design space: smaller models with smarter inference can compete with larger models on hard tasks. It also creates a new axis of competition - not just who has the biggest model, but who has the best approach to reasoning at inference time.

The conventional view of AI model capability is that it is fixed by training: a larger model trained on more data for longer will be more capable, and once training is done, the capability is locked in. Test-time compute challenges this framing. It turns out that you can extract significantly better performance from a fixed model by spending more computation at the moment it generates an answer.

The simplest form of test-time compute is sampling multiple times and taking a majority vote - this is self-consistency. But more sophisticated approaches go further. You can prompt the model to reason step by step before answering (chain-of-thought). You can have the model generate a candidate answer, then critique it and revise it. You can run multiple specialised agents in parallel on different aspects of a problem and aggregate their findings. You can implement search over a tree of possible reasoning paths, exploring more branches when the model is uncertain.

OpenAI''s o1 and o3 models represent the most prominent current example. These models are trained to use extended reasoning - generating long internal chains of thought before producing a final answer - and the quality of their outputs scales with how much computation is allocated to the reasoning process. Allocating more compute at test time (allowing longer thinking) consistently improves performance, including on tasks that were previously considered at the limits of what language models could do.

This has significant implications for AI capability development. If test-time compute can substitute for training compute to some degree, then a moderately-sized model with smart inference could match or exceed a much larger model with naive inference. It also suggests that some tasks are better addressed by investing in better reasoning at query time than in training larger base models.

The cost trade-off is real: more test-time compute means slower responses and higher inference costs. But for high-stakes decisions where quality matters more than speed - debugging complex code, solving research problems, making difficult medical or legal assessments - the trade-off often favours spending more compute per query.

Analogy

The difference between a snap judgement and a considered analysis. A snap judgement is fast but may miss nuances. Taking an hour to think through a problem, explore alternatives, and stress-test your reasoning typically produces better conclusions. Test-time compute gives AI models the equivalent of that extended thinking time - and scales the quality of output with how much thinking time is allowed.

Real-world example

OpenAI's o1 model, when tested on the 2024 American Mathematics Olympiad qualifying exam, scored in the 89th percentile among human test-takers. Previous models achieved around 12%. The key change was extended test-time reasoning: o1 was trained to think for much longer before answering, and its performance on hard mathematical problems scaled with the length of its internal reasoning chain.

Why it matters

Test-time compute represents a fundamental shift in how AI capability is understood and developed. It decouples capability from model size to some degree, opening up new design space: smaller models with smarter inference can compete with larger models on hard tasks. It also creates a new axis of competition - not just who has the biggest model, but who has the best approach to reasoning at inference time.

In the news

No recent coverage - search for Test-Time Compute.

Related concepts

Inference

Using a trained AI model to make predictions or generate outputs - the fast, cheap counterpart to training's slow, expensive computation.

Prompt Engineering

The practice of carefully crafting the instructions you give an AI to get better, more reliable results - it turns out how you ask matters enormously.

Self-Consistency

A prompting technique that generates multiple independent reasoning paths to the same question and selects the answer that appears most often - dramatically improving accuracy on complex reasoning tasks.

← Back to concepts