Concept

Self-Consistency

A prompting technique that generates multiple independent reasoning paths to the same question and selects the answer that appears most often - dramatically improving accuracy on complex reasoning tasks.

Added May 18, 2026

Language models are probabilistic: the same prompt can produce different outputs on different runs, because each token is sampled from a probability distribution. For most tasks this variability is fine or even desirable. But for tasks requiring precise reasoning - maths problems, logic puzzles, multi-step question answering - variability is a liability. Any single reasoning chain can make a mistake at any step, and a wrong intermediate conclusion produces a wrong final answer.

Self-consistency, introduced by researchers at Google Brain in 2022, exploits the probabilistic nature of language models as a feature rather than a bug. Instead of generating one answer, generate many - typically 10 to 40 different reasoning chains for the same question. Each chain may take different approaches, make different intermediate inferences, or phrase steps differently. Then look at the final answers across all chains and select the one that appears most often.

The intuition is that there are many ways to make a mistake but fewer ways to get the right answer. Different reasoning paths will make different errors, and if you have enough of them, the correct answer will tend to be the one that multiple independent chains converge on. Errors are distributed across paths while correct answers concentrate.

The empirical results were striking. On mathematical reasoning benchmarks, self-consistency improved chain-of-thought prompting by 10 to 20 percentage points - without any changes to the model, just by sampling multiple times and taking a majority vote. On grade-school maths, accuracy jumped from around 56% with single chain-of-thought to over 74% with self-consistency. Similar improvements appeared across arithmetic, commonsense reasoning, and symbolic tasks.

The trade-off is cost: generating 20 reasoning chains uses 20 times the tokens of generating one, which means 20 times the inference cost. For tasks where accuracy is critical - medical reasoning, mathematical problem-solving, legal analysis - this cost may be justified. For general conversational use, it is not. Self-consistency sits at the precision end of a quality-cost trade-off.

Analogy

Asking five independent experts the same question and going with whichever answer the majority gives. Any individual expert might make an error. But if four out of five arrive at the same conclusion through different reasoning, you can be substantially more confident in that answer than in any single expert's opinion. Self-consistency applies this intuition to the many possible reasoning paths a language model can generate.

Real-world example

Medical diagnosis AI systems sometimes apply self-consistency principles: rather than generating a single diagnostic reasoning chain, the system produces multiple chains that each consider different symptom combinations and differential diagnoses. If eight out of ten chains arrive at the same primary diagnosis despite taking different paths, that convergence is a stronger signal than any single chain. The approach is particularly valuable where false confidence in an incorrect answer carries high stakes.

Why it matters

Self-consistency reveals that reasoning quality is not a fixed property of a model but can be extracted more or less effectively depending on how you query it. The same model produces substantially different accuracy levels depending on whether you sample once or many times and aggregate. This insight - that inference-time compute can substitute for model capability - has become foundational to a broader understanding of test-time scaling.

In the news

Swiss Cloud Company Transfers Majority Voting Rights to Public-Interest Foundation
Hacker News · 4h ago

Related concepts

Hallucination Prompt Engineering Test-Time Compute

← Back to concepts