reasoning intermediate

Self-Consistency

Improve reasoning accuracy by sampling multiple chain-of-thought paths and taking a majority vote on the final answer.

self-consistencyvotingsamplingreasoningreliabilityensemble

Overview

Self-Consistency extends Chain-of-Thought by recognizing that complex problems often have multiple valid reasoning paths that lead to the correct answer. Instead of relying on a single greedy CoT decode, you sample multiple reasoning chains (using temperature > 0) and take a majority vote on the final answers.

This simple technique significantly boosts accuracy on math, logic, and commonsense reasoning tasks.

When to Use

  • Tasks where Chain-of-Thought alone still makes errors
  • Math and logic problems with deterministic correct answers
  • You can afford extra inference cost (N calls instead of 1)
  • Decision-making where you want a confidence estimate
  • Critical outputs where reliability matters more than speed

Architecture

flowchart TB
    Q[Question + CoT Prompt] --> S1[Sample 1<br>Path A → Answer: 42]
    Q --> S2[Sample 2<br>Path B → Answer: 42]
    Q --> S3[Sample 3<br>Path C → Answer: 38]
    Q --> S4[Sample 4<br>Path D → Answer: 42]
    Q --> S5[Sample 5<br>Path E → Answer: 42]
    
    S1 --> V[Majority Vote]
    S2 --> V
    S3 --> V
    S4 --> V
    S5 --> V
    
    V --> A[Final Answer: 42<br>Confidence: 80%]
    
    style S3 fill:#1c2128,stroke:#f85149,color:#e6edf3

How It Works

  1. Prompt the LLM with a CoT prompt (zero-shot or few-shot)
  2. Sample N responses with temperature > 0 (typically 0.5-0.8)
  3. Extract the final answer from each response
  4. Vote — the most common answer wins
  5. Confidence = (count of majority answer) / N

Implementation

▶ Interactive Example (python)

Gotchas & Best Practices

🚨 Cost Multiplier

N=5 means 5x the cost and combined latency (or 5x parallel cost). Balance accuracy gains against budget. For most tasks, N=3-5 is sufficient; beyond N=10, gains diminish.

⚠️ Answer Extraction Is Tricky

Different reasoning paths may express the same answer differently (“60 mph”, “60”, “sixty”). Normalize answers before voting — strip units, convert numbers, etc.

💡 Use Confidence as a Signal

Low consensus confidence (e.g., 40% on a 5-sample vote) is a reliable signal that the question is hard or ambiguous. Route these to a human reviewer or a more capable model.

💡 Parallelize for Speed

Send all N requests in parallel rather than sequentially. This reduces wall-clock time to approximately a single inference call while still getting the voting benefit.

Variations

  • Majority Vote — Most common answer wins
  • Weighted Vote — Weight by model confidence/log-probability
  • Universal Self-Consistency — For open-ended tasks where exact answer matching fails
  • Adaptive Sampling — Stop early if consensus is already strong

Further Reading