Self-Consistency
Improve reasoning accuracy by sampling multiple chain-of-thought paths and taking a majority vote on the final answer.
Overview
Self-Consistency extends Chain-of-Thought by recognizing that complex problems often have multiple valid reasoning paths that lead to the correct answer. Instead of relying on a single greedy CoT decode, you sample multiple reasoning chains (using temperature > 0) and take a majority vote on the final answers.
This simple technique significantly boosts accuracy on math, logic, and commonsense reasoning tasks.
When to Use
- Tasks where Chain-of-Thought alone still makes errors
- Math and logic problems with deterministic correct answers
- You can afford extra inference cost (N calls instead of 1)
- Decision-making where you want a confidence estimate
- Critical outputs where reliability matters more than speed
Architecture
flowchart TB
Q[Question + CoT Prompt] --> S1[Sample 1<br>Path A → Answer: 42]
Q --> S2[Sample 2<br>Path B → Answer: 42]
Q --> S3[Sample 3<br>Path C → Answer: 38]
Q --> S4[Sample 4<br>Path D → Answer: 42]
Q --> S5[Sample 5<br>Path E → Answer: 42]
S1 --> V[Majority Vote]
S2 --> V
S3 --> V
S4 --> V
S5 --> V
V --> A[Final Answer: 42<br>Confidence: 80%]
style S3 fill:#1c2128,stroke:#f85149,color:#e6edf3
How It Works
- Prompt the LLM with a CoT prompt (zero-shot or few-shot)
- Sample N responses with temperature > 0 (typically 0.5-0.8)
- Extract the final answer from each response
- Vote — the most common answer wins
- Confidence = (count of majority answer) / N
Implementation
Gotchas & Best Practices
N=5 means 5x the cost and combined latency (or 5x parallel cost). Balance accuracy gains against budget. For most tasks, N=3-5 is sufficient; beyond N=10, gains diminish.
Different reasoning paths may express the same answer differently (“60 mph”, “60”, “sixty”). Normalize answers before voting — strip units, convert numbers, etc.
Low consensus confidence (e.g., 40% on a 5-sample vote) is a reliable signal that the question is hard or ambiguous. Route these to a human reviewer or a more capable model.
Send all N requests in parallel rather than sequentially. This reduces wall-clock time to approximately a single inference call while still getting the voting benefit.
Variations
- Majority Vote — Most common answer wins
- Weighted Vote — Weight by model confidence/log-probability
- Universal Self-Consistency — For open-ended tasks where exact answer matching fails
- Adaptive Sampling — Stop early if consensus is already strong