← Prompt Engineering: Techniques & Patterns Structured Outputs: JSON, Schema Validation, Constrained Decoding →

Few-Shot, Chain-of-Thought, Tree-of-Thought

> Telling a model what to do is prompting. Showing it how to think is engineering. The gap between 78% and 91% accuracy on the same model, same task, same data is not a better model. It is a better reasoning strategy.

Type: Build

Languages: Python

Prerequisites: Lesson 11.01 (Prompt Engineering)

Time: ~45 minutes

Learning Objectives

Implement few-shot prompting by selecting and formatting example demonstrations that maximize task accuracy
Apply chain-of-thought (CoT) reasoning to improve accuracy on multi-step problems like math word problems
Build a tree-of-thought prompt that explores multiple reasoning paths and selects the best one
Measure the accuracy improvement from zero-shot vs few-shot vs CoT on a standard benchmark

The Problem

You build a math tutoring app. Your prompt says: "Solve this word problem." GPT-5 gets it right 94% of the time on GSM8K, the standard grade-school math benchmark. You think you already peaked. You do not — chain-of-thought still adds 3-4 points.

Add five words -- "Let's think step by step" -- and accuracy jumps to 91%. Add a few worked examples and it reaches 95%. Same model. Same temperature. Same API cost. The only difference is that you gave the model scratch paper.

This is not a hack. It is how reasoning works. Humans do not solve multi-step problems in one mental leap. Neither do transformers. When you force a model to generate intermediate tokens, those tokens become part of the context for the next token. Each reasoning step feeds the next. The model literally computes its way to the answer.

But "think step by step" is the beginning, not the end. What if you sampled five reasoning paths and took a majority vote? What if you let the model explore a tree of possibilities, evaluating and pruning branches? What if you interleaved reasoning with tool use? These are not hypotheticals. They are published techniques with measured improvements, and you will build all of them in this lesson.

The Concept

Zero-Shot vs Few-Shot: When Examples Beat Instructions

Zero-shot prompting gives the model a task and nothing else. Few-shot prompting gives it examples first.

Wei et al. (2022) measured this across 8 benchmarks. For simple tasks like sentiment classification, zero-shot and few-shot performed within 2% of each other. For complex tasks like multi-step arithmetic and symbolic reasoning, few-shot improved accuracy by 10-25%.

The intuition: examples are compressed instructions. Instead of describing the output format, you show it. Instead of explaining the reasoning process, you demonstrate it. The model pattern-matches on the examples more reliably than it interprets abstract instructions.

graph TD subgraph Comparison["Zero-Shot vs Few-Shot"] direction LR Z["Zero-Shot\n'Classify this review'\nModel guesses format\n78% on GSM8K"] F["Few-Shot\n'Here are 3 examples...\nNow classify this review'\nModel matches pattern\n85% on GSM8K"] end Z ~~~ F style Z fill:#1a1a2e,stroke:#e94560,color:#fff style F fill:#1a1a2e,stroke:#51cf66,color:#fff

When few-shot wins: format-sensitive tasks, classification, structured extraction, domain-specific jargon, any task where the model needs to match a specific pattern.

When zero-shot wins: simple factual questions, creative tasks where examples constrain creativity, tasks where finding good examples is harder than writing good instructions.

Example Selection: Similar Beats Random

Not all examples are equal. Choosing examples similar to the target input outperforms random selection by 5-15% on classification tasks (Liu et al., 2022). Three principles:

Semantic similarity: pick examples closest to the input in embedding space
Label diversity: cover all output categories in your examples
Difficulty matching: match the complexity level of the target problem

The optimal number of examples for most tasks is 3-5. Below 3, the model does not have enough signal to extract the pattern. Above 5, you hit diminishing returns and waste context window tokens. For classification with many labels, use one example per label.

Chain-of-Thought: Giving Models Scratch Paper

Chain-of-Thought (CoT) prompting was introduced by Wei et al. (2022) at Google Brain. The idea is simple: instead of asking the model for just the answer, ask it to show its reasoning steps first.

graph LR subgraph Standard["Standard Prompting"] Q1["Q: Roger has 5 balls.\nHe buys 2 cans of 3.\nHow many balls?"] --> A1["A: 11"] end subgraph CoT["Chain-of-Thought Prompting"] Q2["Q: Roger has 5 balls.\nHe buys 2 cans of 3.\nHow many balls?"] --> R2["Roger starts with 5.\n2 cans of 3 = 6.\n5 + 6 = 11."] --> A2["A: 11"] end style Q1 fill:#1a1a2e,stroke:#e94560,color:#fff style A1 fill:#1a1a2e,stroke:#e94560,color:#fff style Q2 fill:#1a1a2e,stroke:#51cf66,color:#fff style R2 fill:#1a1a2e,stroke:#ffa500,color:#fff style A2 fill:#1a1a2e,stroke:#51cf66,color:#fff

Why does this work mechanically? Each token a transformer generates becomes context for the next token. Without CoT, the model must compress all reasoning into the hidden state of a single forward pass. With CoT, the model externalizes intermediate computations as tokens. Each reasoning token extends the effective computation depth.

GSM8K benchmarks (grade-school math, 8.5K problems):

Model	Zero-Shot	Zero-Shot CoT	Few-Shot CoT
GPT-4o	78%	91%	95%
GPT-5	94%	97%	98%
o4-mini (reasoning)	97%	—	—
Claude Opus 4.7	93%	97%	98%
Gemini 3 Pro	92%	96%	98%
Llama 4 70B	80%	89%	94%
DeepSeek-V3.1	89%	94%	96%

Note on reasoning models. Models like OpenAI's o-series (o3, o4-mini) and DeepSeek-R1 run chain-of-thought internally before emitting their answer. Adding "Let's think step by step" to a reasoning model is redundant and sometimes counterproductive — they have already done it.

Two flavors of CoT:

Zero-shot CoT: append "Let's think step by step" to the prompt. No examples needed. Kojima et al. (2022) showed this single sentence improves accuracy across arithmetic, commonsense, and symbolic reasoning tasks.

Few-shot CoT: provide examples that include reasoning steps. More effective than zero-shot CoT because the model sees the exact reasoning format you expect.

When CoT hurts: simple factual recall ("What is the capital of France?"), single-step classification, tasks where speed matters more than accuracy. CoT adds 50-200 tokens of reasoning overhead per query. For high-throughput, low-complexity tasks, that is wasted cost.

Self-Consistency: Sample Many, Vote Once

Wang et al. (2023) introduced self-consistency. The insight: a single CoT path might contain reasoning errors. But if you sample N independent reasoning paths (using temperature > 0) and take the majority vote on the final answer, errors cancel out.

graph TD P["Problem: 'A store has 48 apples.\nThey sell 1/3 on Monday\nand 1/4 of the rest on Tuesday.\nHow many are left?'"] P --> Path1["Path 1: 48 - 16 = 32\n32 - 8 = 24\nAnswer: 24"] P --> Path2["Path 2: 1/3 of 48 = 16\nRemaining: 32\n1/4 of 32 = 8\n32 - 8 = 24\nAnswer: 24"] P --> Path3["Path 3: 48/3 = 16 sold\n48 - 16 = 32\n32/4 = 8 sold\n32 - 8 = 24\nAnswer: 24"] P --> Path4["Path 4: Sell 1/3: 48 - 12 = 36\nSell 1/4: 36 - 9 = 27\nAnswer: 27"] P --> Path5["Path 5: Monday: 48 * 2/3 = 32\nTuesday: 32 * 3/4 = 24\nAnswer: 24"] Path1 --> V["Majority Vote\n24: 4 votes\n27: 1 vote\nFinal: 24"] Path2 --> V Path3 --> V Path4 --> V Path5 --> V style P fill:#1a1a2e,stroke:#ffa500,color:#fff style Path1 fill:#1a1a2e,stroke:#51cf66,color:#fff style Path2 fill:#1a1a2e,stroke:#51cf66,color:#fff style Path3 fill:#1a1a2e,stroke:#51cf66,color:#fff style Path4 fill:#1a1a2e,stroke:#e94560,color:#fff style Path5 fill:#1a1a2e,stroke:#51cf66,color:#fff style V fill:#1a1a2e,stroke:#51cf66,color:#fff

Self-consistency improved GSM8K accuracy from 56.5% (single CoT) to 74.4% with N=40 on the original PaLM 540B experiments. On GPT-5 the improvement is small (97% to 98%) because base accuracy is already saturated. The technique shines most on models with 60-85% base CoT accuracy -- the sweet spot where single-path errors are frequent but not systematic. For reasoning models (o-series, R1) self-consistency is subsumed by the built-in internal sampling.

The tradeoff: N samples means Nx the API cost and latency. In practice, N=5 captures most of the benefit. N=3 is the minimum for a meaningful vote. N > 10 has diminishing returns for most tasks.

Tree-of-Thought: Branching Exploration

Yao et al. (2023) introduced Tree-of-Thought (ToT). Where CoT follows one linear reasoning path, ToT explores multiple branches and evaluates which are most promising before continuing.

graph TD Root["Problem"] --> B1["Thought 1a"] Root --> B2["Thought 1b"] Root --> B3["Thought 1c"] B1 --> E1["Eval: 0.8"] B2 --> E2["Eval: 0.3"] B3 --> E3["Eval: 0.9"] E1 -->|Continue| B1a["Thought 2a"] E1 -->|Continue| B1b["Thought 2b"] E3 -->|Continue| B3a["Thought 2a"] E3 -->|Continue| B3b["Thought 2b"] E2 -->|Prune| X["X"] B1a --> E4["Eval: 0.7"] B3a --> E5["Eval: 0.95"] E5 -->|Best path| Final["Solution"] style Root fill:#1a1a2e,stroke:#ffa500,color:#fff style E2 fill:#1a1a2e,stroke:#e94560,color:#fff style X fill:#1a1a2e,stroke:#e94560,color:#fff style E5 fill:#1a1a2e,stroke:#51cf66,color:#fff style Final fill:#1a1a2e,stroke:#51cf66,color:#fff style B1 fill:#1a1a2e,stroke:#808080,color:#fff style B2 fill:#1a1a2e,stroke:#808080,color:#fff style B3 fill:#1a1a2e,stroke:#808080,color:#fff style B1a fill:#1a1a2e,stroke:#808080,color:#fff style B1b fill:#1a1a2e,stroke:#808080,color:#fff style B3a fill:#1a1a2e,stroke:#808080,color:#fff style B3b fill:#1a1a2e,stroke:#808080,color:#fff style E1 fill:#1a1a2e,stroke:#808080,color:#fff style E3 fill:#1a1a2e,stroke:#808080,color:#fff style E4 fill:#1a1a2e,stroke:#808080,color:#fff

ToT has three components:

Thought generation: produce multiple candidate next-steps
State evaluation: score each candidate (can use the LLM itself as evaluator)
Search algorithm: BFS or DFS through the tree, pruning low-scoring branches

On the Game of 24 task (combine 4 numbers using arithmetic to make 24), GPT-4 with standard prompting solves 7.3% of problems. With CoT, 4.0% (CoT actually hurts here because the search space is wide). With ToT, 74%.

ToT is expensive. Each node in the tree requires an LLM call. A tree with branching factor 3 and depth 3 requires up to 39 LLM calls. Use it only for problems where the search space is large but evaluatable -- planning, puzzle solving, creative problem-solving with constraints.

ReAct: Thinking + Doing

Yao et al. (2022) combined reasoning traces with actions. The model alternates between thinking (generating reasoning) and acting (calling tools, searching, computing).

graph LR Q["Question:\nWhat is the\npopulation of the\ncountry where\nthe Eiffel Tower\nis located?"] T1["Thought: I need to\nfind which country\nhas the Eiffel Tower"] A1["Action: search\n'Eiffel Tower location'"] O1["Observation:\nParis, France"] T2["Thought: Now I need\nFrance's population"] A2["Action: search\n'France population 2024'"] O2["Observation:\n68.4 million"] T3["Thought: I have\nthe answer"] F["Answer:\n68.4 million"] Q --> T1 --> A1 --> O1 --> T2 --> A2 --> O2 --> T3 --> F style Q fill:#1a1a2e,stroke:#ffa500,color:#fff style T1 fill:#1a1a2e,stroke:#51cf66,color:#fff style A1 fill:#1a1a2e,stroke:#e94560,color:#fff style O1 fill:#1a1a2e,stroke:#808080,color:#fff style T2 fill:#1a1a2e,stroke:#51cf66,color:#fff style A2 fill:#1a1a2e,stroke:#e94560,color:#fff style O2 fill:#1a1a2e,stroke:#808080,color:#fff style T3 fill:#1a1a2e,stroke:#51cf66,color:#fff style F fill:#1a1a2e,stroke:#51cf66,color:#fff

ReAct outperforms pure CoT on knowledge-intensive tasks because it can ground its reasoning in real data. On HotpotQA (multi-hop question answering), ReAct with GPT-4 achieves 35.1% exact match vs 29.4% for CoT alone. The real power is that reasoning errors get corrected by observations -- the model can update its plan mid-execution.

ReAct is the foundation of modern AI agents. Every agent framework (LangChain, CrewAI, AutoGen) implements some variant of the Thought-Action-Observation loop. You will build full agents in Phase 14. This lesson covers the prompting pattern.

Structured Prompting: XML Tags, Delimiters, Headers

As prompts get complex, structure prevents the model from confusing sections. Three approaches:

XML tags (works best with Claude, solid everywhere):

<context>
You are reviewing a pull request.
The codebase uses TypeScript and React.
</context>

<task>
Review the following diff for bugs, security issues, and style violations.
</task>

<diff>
{diff_content}
</diff>

<output_format>
List each issue with: file, line, severity (critical/warning/info), description.
</output_format>

Markdown headers (universal):

## Role
Senior security engineer at a fintech company.

## Task
Analyze this API endpoint for vulnerabilities.

## Input
{api_code}

## Rules
- Focus on OWASP Top 10
- Rate each finding: critical, high, medium, low
- Include remediation steps

Delimiters (minimal but effective):

---INPUT---
{user_text}
---END INPUT---

---INSTRUCTIONS---
Summarize the above in 3 bullet points.
---END INSTRUCTIONS---

Prompt Chaining: Sequential Decomposition

Some tasks are too complex for a single prompt. Prompt chaining breaks them into steps, where the output of one prompt becomes the input of the next.

graph LR I["Raw Input"] --> P1["Prompt 1:\nExtract\nkey facts"] P1 --> O1["Facts"] O1 --> P2["Prompt 2:\nAnalyze\nfacts"] P2 --> O2["Analysis"] O2 --> P3["Prompt 3:\nGenerate\nrecommendation"] P3 --> F["Final Output"] style I fill:#1a1a2e,stroke:#808080,color:#fff style P1 fill:#1a1a2e,stroke:#e94560,color:#fff style O1 fill:#1a1a2e,stroke:#ffa500,color:#fff style P2 fill:#1a1a2e,stroke:#e94560,color:#fff style O2 fill:#1a1a2e,stroke:#ffa500,color:#fff style P3 fill:#1a1a2e,stroke:#e94560,color:#fff style F fill:#1a1a2e,stroke:#51cf66,color:#fff

Chaining beats single-prompt for three reasons:

Each step is simpler: the model handles one focused task instead of juggling everything
Intermediate outputs are inspectable: you can validate and correct between steps
Different steps can use different models: use a cheap model for extraction, an expensive one for reasoning

Performance Comparison

Technique	Best For	GSM8K Accuracy (GPT-5)	API Calls	Token Overhead	Complexity
Zero-Shot	Simple tasks	94%	1	None	Trivial
Few-Shot	Format matching	96%	1	200-500 tokens	Low
Zero-Shot CoT	Quick reasoning boost	97%	1	50-200 tokens	Trivial
Few-Shot CoT	Maximum single-call accuracy	98%	1	300-600 tokens	Low
Self-Consistency (N=5)	High-stakes reasoning	98.5%	5	5x token cost	Medium
Reasoning model (o4-mini)	Drop-in CoT replacement	97%	1	hidden (2-10x internal)	Trivial
Tree-of-Thought	Search/planning problems	N/A (74% on Game of 24)	10-40+	10-40x token cost	High
ReAct	Knowledge-grounded reasoning	N/A (35.1% on HotpotQA)	3-10+	Variable	High
Prompt Chaining	Complex multi-step tasks	96% (pipeline)	2-5	2-5x token cost	Medium

The right technique depends on three factors: accuracy requirement, latency budget, and cost tolerance. For most production systems, few-shot CoT with a 3-sample self-consistency fallback covers 90% of use cases.

Build It

We will build a math problem solver that combines few-shot prompting, chain-of-thought reasoning, and self-consistency voting into a single pipeline. Then we will add tree-of-thought for hard problems.

The full implementation is in code/advanced_prompting.py. Here are the key components.

Step 1: Few-Shot Example Store

The first component manages few-shot examples and selects the most relevant ones for a given problem.

GSM8K_EXAMPLES = [
    {
        "question": "Janet's ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells every egg at the farmers' market for $2. How much does she make every day at the farmers' market?",
        "reasoning": "Janet's ducks lay 16 eggs per day. She eats 3 and bakes 4, using 3 + 4 = 7 eggs. So she has 16 - 7 = 9 eggs left. She sells each for $2, so she makes 9 * 2 = $18 per day.",
        "answer": "18"
    },
    ...
]

Each example has three parts: the question, the reasoning chain, and the final answer. The reasoning chain is what transforms a regular few-shot example into a CoT few-shot example.

Step 2: Chain-of-Thought Prompt Builder

The prompt builder assembles a system message, few-shot examples with reasoning chains, and the target question into a single prompt.

def build_cot_prompt(question, examples, num_examples=3):
    system = (
        "You are a math problem solver. "
        "For each problem, show your step-by-step reasoning, "
        "then give the final numerical answer on the last line "
        "in the format: 'The answer is [number]'."
    )

    example_text = ""
    for ex in examples[:num_examples]:
        example_text += f"Q: {ex['question']}\n"
        example_text += f"A: {ex['reasoning']} The answer is {ex['answer']}.\n\n"

    user = f"{example_text}Q: {question}\nA:"
    return system, user

The format constraint ("The answer is [number]") is critical. Without it, self-consistency cannot extract and compare answers across samples.

Step 3: Self-Consistency Voting

Sample N reasoning paths and take the majority answer.

def self_consistency_solve(question, examples, client, model, n_samples=5):
    system, user = build_cot_prompt(question, examples)

    answers = []
    reasonings = []
    for _ in range(n_samples):
        response = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": system},
                {"role": "user", "content": user}
            ],
            temperature=0.7
        )
        text = response.choices[0].message.content
        reasonings.append(text)
        answer = extract_answer(text)
        if answer is not None:
            answers.append(answer)

    vote_counts = Counter(answers)
    best_answer = vote_counts.most_common(1)[0][0] if vote_counts else None
    confidence = vote_counts[best_answer] / len(answers) if best_answer else 0

    return best_answer, confidence, reasonings, vote_counts

Temperature 0.7 is important. At temperature 0.0, all N samples would be identical, defeating the purpose. You need enough randomness for diverse reasoning paths but not so much that the model produces gibberish.

Step 4: Tree-of-Thought Solver

For problems where linear reasoning fails, ToT explores multiple approaches and evaluates which direction is most promising.

def tree_of_thought_solve(question, client, model, breadth=3, depth=3):
    thoughts = generate_initial_thoughts(question, client, model, breadth)
    scored = [(t, evaluate_thought(t, question, client, model)) for t in thoughts]
    scored.sort(key=lambda x: x[1], reverse=True)

    for current_depth in range(1, depth):
        next_thoughts = []
        for thought, score in scored[:2]:
            extensions = extend_thought(thought, question, client, model, breadth)
            for ext in extensions:
                ext_score = evaluate_thought(ext, question, client, model)
                next_thoughts.append((ext, ext_score))
        scored = sorted(next_thoughts, key=lambda x: x[1], reverse=True)

    best_thought = scored[0][0] if scored else ""
    return extract_answer(best_thought), best_thought

The evaluator is itself an LLM call. You ask the model: "On a scale of 0.0 to 1.0, how promising is this reasoning path for solving the problem?" This is the key insight of ToT -- the model evaluates its own partial solutions.

Step 5: Full Pipeline

The pipeline combines all techniques with an escalation strategy.

def solve_with_escalation(question, examples, client, model):
    system, user = build_cot_prompt(question, examples)
    single_response = call_llm(client, model, system, user, temperature=0.0)
    single_answer = extract_answer(single_response)

    sc_answer, confidence, _, _ = self_consistency_solve(
        question, examples, client, model, n_samples=5
    )

    if confidence >= 0.8:
        return sc_answer, "self_consistency", confidence

    tot_answer, _ = tree_of_thought_solve(question, client, model)
    return tot_answer, "tree_of_thought", None

The escalation logic: try cheap (single CoT) first. If self-consistency confidence is below 0.8 (less than 4 of 5 samples agree), escalate to ToT. This balances cost and accuracy -- most problems are solved cheaply, hard problems get more compute.

Use It

With LangChain

LangChain provides built-in support for prompt templates and output parsing that simplify few-shot and CoT patterns:

from langchain_core.prompts import FewShotPromptTemplate, PromptTemplate
from langchain_openai import ChatOpenAI

example_prompt = PromptTemplate(
    input_variables=["question", "reasoning", "answer"],
    template="Q: {question}\nA: {reasoning} The answer is {answer}."
)

few_shot_prompt = FewShotPromptTemplate(
    examples=examples,
    example_prompt=example_prompt,
    suffix="Q: {input}\nA: Let's think step by step.",
    input_variables=["input"]
)

llm = ChatOpenAI(model="gpt-4o", temperature=0.7)
chain = few_shot_prompt | llm
result = chain.invoke({"input": "If a train travels 120 km in 2 hours..."})

LangChain also has ExampleSelector classes for semantic similarity selection:

from langchain_core.example_selectors import SemanticSimilarityExampleSelector
from langchain_openai import OpenAIEmbeddings

selector = SemanticSimilarityExampleSelector.from_examples(
    examples,
    OpenAIEmbeddings(),
    k=3
)

With DSPy

DSPy treats prompting strategies as optimizable modules. Instead of handcrafting CoT prompts, you define a signature and let DSPy optimize the prompt:

import dspy

dspy.configure(lm=dspy.LM("openai/gpt-4o", temperature=0.7))

class MathSolver(dspy.Module):
    def __init__(self):
        self.solve = dspy.ChainOfThought("question -> answer")

    def forward(self, question):
        return self.solve(question=question)

solver = MathSolver()
result = solver(question="Janet's ducks lay 16 eggs per day...")

DSPy's ChainOfThought automatically adds reasoning traces. dspy.majority implements self-consistency:

result = dspy.majority(
    [solver(question=q) for _ in range(5)],
    field="answer"
)

Comparison: From-Scratch vs Frameworks

Feature	From-Scratch (this lesson)	LangChain	DSPy
Control over prompt format	Full	Template-based	Automatic
Self-consistency	Manual voting	Manual	Built-in (`dspy.majority`)
Example selection	Custom logic	`ExampleSelector`	`dspy.BootstrapFewShot`
Tree-of-Thought	Custom tree search	Community chains	Not built-in
Prompt optimization	Manual iteration	Manual	Automatic compilation
Best for	Learning, custom pipelines	Standard workflows	Research, optimization

Ship It

This lesson produces two artifacts.

1. Reasoning Chain Prompt (outputs/prompt-reasoning-chain.md): a production-ready prompt template for few-shot CoT with self-consistency. Plug in your examples and problem domain.

2. CoT Pattern Selection Skill (outputs/skill-cot-patterns.md): a decision framework for choosing the right reasoning technique based on task type, accuracy requirements, and cost constraints.

Exercises

Measure the gap: Take 10 GSM8K problems. Solve each with zero-shot, few-shot, zero-shot CoT, and few-shot CoT. Record accuracy for each. Which technique gives the biggest lift on your model?

Example selection experiment: For the same 10 problems, compare random example selection vs hand-picked similar examples. Measure accuracy difference. At what point does example quality matter more than example quantity?

Self-consistency cost curve: Run self-consistency with N=1, 3, 5, 7, 10 on 20 GSM8K problems. Plot accuracy vs cost (total tokens). Where is the knee of the curve for your model?

Build a ReAct loop: Extend the pipeline with a calculator tool. When the model generates a math expression, execute it with Python's eval() (in a sandbox) and feed the result back. Measure if tool-grounded reasoning outperforms pure CoT.

ToT for creative tasks: Adapt the Tree-of-Thought solver for a creative writing task: "Write a 6-word story that is both funny and sad." Use the LLM as evaluator. Does branching exploration produce better creative outputs than single-shot generation?

Key Terms

Term	What people say	What it actually means
Few-shot prompting	"Give it some examples"	Including input-output demonstrations in the prompt to anchor the model's output format and behavior
Chain-of-Thought	"Make it think step by step"	Eliciting intermediate reasoning tokens that extend the model's effective computation before producing a final answer
Self-Consistency	"Run it multiple times"	Sampling N diverse reasoning paths at temperature > 0 and selecting the most common final answer by majority vote
Tree-of-Thought	"Let it explore options"	Structured search over reasoning branches where each partial solution is evaluated and only promising paths are expanded
ReAct	"Thinking + tool use"	Interleaving reasoning traces with external actions (search, compute, API calls) in a Thought-Action-Observation loop
Prompt chaining	"Break it into steps"	Decomposing a complex task into sequential prompts where each output feeds the next input
Zero-shot CoT	"Just add 'think step by step'"	Appending a reasoning trigger phrase to a prompt without any examples, relying on the model's latent reasoning capability