CCA-F Study Day 14/20: Explicit Criteria & Few-Shot Prompting

Domain 4: Prompt Engineering & Structured Output (~20% of exam)

📌 Today's Focus

Welcome to Domain 4 — Prompt Engineering & Structured Output. This domain carries ~20% of the exam weight and is where architectural decisions meet prompt craftsmanship. Today's focus is the foundation layer: explicit criteria, few-shot (multishot) prompting, XML tag structuring, and chain of thought.

Why this matters: The exam presents scenarios where you need to design prompts that produce reliable, structured results at scale. The wrong answers will tempt you with "clever" shortcuts that work in demos but fail in production. The right answers always involve explicit criteria + examples + structure.

📚 Core Concepts

1. Explicit Criteria — Tell Claude Exactly What "Good" Looks Like

The #1 prompt engineering principle for production systems: be explicit about success criteria. This isn't about being verbose — it's about removing ambiguity.

Anthropic's framework for prompt engineering:

Define success criteria — What does a correct output look like?
Establish empirical tests — How will you measure correctness?
Write a first draft prompt — Then iterate based on test results

Key principles the exam tests:

Be clear and direct — state exactly what you want
Specify constraints: format, length, style, edge case behavior
Define what success AND failure look like
Use role prompting via system prompts for consistent persona

# ❌ Vague prompt (ANTI-PATTERN — exam trap!)
"Analyze this customer feedback and give me insights"

# ✅ Explicit criteria (CORRECT — what the exam wants)
"""Analyze the customer feedback below and produce a JSON object with:
- sentiment: one of "positive", "negative", "neutral", "mixed"
- topics: array of 1-3 topic codes from this list: [billing, product, shipping, support]
- urgency: "high" if the customer mentions a deadline, legal action, or cancellation within 24h; "medium" if frustrated; "low" otherwise
- action_required: boolean, true if the customer is requesting something specific

If the feedback is ambiguous, default to "neutral" sentiment and "low" urgency.
Never infer topics not explicitly mentioned in the feedback."""

🚨 Exam Trap: The exam will present a scenario where Claude's output is inconsistent. The wrong answer will be "add more examples." The right answer is often "make the criteria more explicit" — examples help, but clear rules are the foundation.

2. Few-Shot (Multishot) Prompting — Show, Don't Just Tell

Few-shot prompting means providing 3-5 examples of input→output pairs that demonstrate exactly what you want. Anthropic calls this "multishot prompting" and considers it one of the most powerful techniques for format consistency.

Why it works: Claude pattern-matches against your examples. If your examples are consistent, Claude's output will be consistent. If they're sloppy, Claude's output will be sloppy.

Best practices (exam-relevant):

Use XML tags to clearly delineate examples from instructions
Include edge cases in your examples (not just happy path)
Keep format identical across all examples — exact same JSON keys, same field order
Show boundary conditions (what happens with ambiguous input)
3-5 examples is optimal — more doesn't always help and consumes context

<examples>
  <example>
    <input>Customer says: "I want to cancel my subscription"</input>
    <ideal_output>{"intent": "cancellation", "sentiment": "neutral", "urgency": "medium", "action_required": true}</ideal_output>
    <reasoning>Cancellation is a specific action request (action_required=true). No emotional language, so neutral. Medium urgency because it implies pending change.</reasoning>
  </example>
  <example>
    <input>Customer says: "This is broken and I need it fixed NOW or I'm calling my lawyer"</input>
    <ideal_output>{"intent": "bug_report", "sentiment": "negative", "urgency": "high", "action_required": true}</ideal_output>
    <reasoning>Mentions legal action → high urgency. Strong negative language. Asks for a fix (action_required=true).</reasoning>
  </example>
  <example>
    <input>Customer says: "Just wondering if you have a mobile app"</input>
    <ideal_output>{"intent": "inquiry", "sentiment": "neutral", "urgency": "low", "action_required": false}</ideal_output>
    <reasoning>Pure information request, no action needed. No emotional indicators. Low urgency.</reasoning>
  </example>
</examples>

Pro tip for the exam: Notice the <reasoning> tag in examples above. Including reasoning in your few-shot examples teaches Claude why it should produce that output, not just what the output is. This dramatically improves generalization to novel inputs.

3. XML Tags for Prompt Structure

Claude is specifically trained to respond well to XML-structured prompts. The exam tests your knowledge of how to use XML tags to create clear, maintainable, production-grade prompts.

The canonical structure:

<instructions>
You are a customer feedback classifier. Analyze the input and produce structured output.
</instructions>

<rules>
- Always include a confidence score (0.0-1.0) based on keyword clarity
- If sentiment is unclear, default to "neutral"
- Maximum 3 topics per feedback item
- Never infer intent that isn't explicitly stated
</rules>

<output_format>
Return valid JSON matching this schema:
{
  "sentiment": "positive|negative|neutral|mixed",
  "topics": ["topic1", "topic2"],
  "urgency": "high|medium|low",
  "confidence": 0.0-1.0
}
</output_format>

<examples>
... (your few-shot examples here)
</examples>

<input>
{{CUSTOMER_FEEDBACK}}
</input>

Why XML over markdown or plain text?

XML provides unambiguous section boundaries — Claude never confuses instructions with examples
You can use variable interpolation (e.g., {{CUSTOMER_FEEDBACK}}) that's clearly separated from static prompt content
It's programmatically composable — you can build prompts by assembling XML sections
Claude's training data makes it particularly attentive to content within XML tags

4. Chain of Thought (Let Claude Think)

Chain of thought (CoT) prompting gives Claude space to reason through problems step-by-step before producing a final answer. This is not the same as extended thinking (which is controlled by effort levels) — CoT is a prompt-level technique.

When to use CoT:

Complex classification with multiple criteria
Math or logical reasoning
Any task where the final answer depends on intermediate reasoning steps
When you need to audit WHY Claude chose an answer

Pattern:

<instructions>
Analyze this support ticket for routing priority.

First, think through the classification step by step in <thinking> tags.
Consider: customer tier, issue severity, SLA implications, and business impact.

Then provide your final classification as JSON.
</instructions>

<input>
{{TICKET_CONTENT}}
</input>

Key distinction for the exam:

Chain of thought in prompts = you explicitly ask Claude to reason before answering (user-controlled)
Extended thinking / effort levels = Claude uses internal reasoning (model-controlled, API parameter)
They can be used together but serve different purposes

⚠️ Anti-Patterns & Exam Traps

Anti-Pattern (Wrong Answer)	Why It's Wrong	Correct Approach
❌ "Just use more examples"	More examples without clear criteria leads to overfitting to specific patterns, not generalizable behavior	✅ Define explicit rules FIRST, then use examples to illustrate those rules
❌ Relying on temperature to fix inconsistency	Temperature affects randomness, not understanding. Low temp with a vague prompt still gives inconsistent results.	✅ Fix the prompt (explicit criteria + examples) before touching parameters
❌ Asking Claude to "be confident" or "don't hallucinate"	These are vague meta-instructions that don't give Claude actionable guidance	✅ Give Claude explicit fallback behavior: "If unsure, respond with {"status": "uncertain", "reason": "..."}"
❌ Putting examples in plain text without delimiters	Claude may confuse example content with real instructions	✅ Always wrap examples in XML tags (<examples><example>...)
❌ Self-reported confidence scores as primary decision mechanism	Claude's self-assessed confidence is unreliable for making automated decisions	✅ Use structured validation criteria and programmatic checks

💻 Code Examples

Complete Production-Grade Classification Prompt

import anthropic

client = anthropic.Anthropic()

CLASSIFICATION_PROMPT = """<instructions>
You are a support ticket classifier. Analyze each ticket and produce structured output.
</instructions>

<rules>
- Classify into exactly one category: billing, technical, account, shipping, other
- Urgency levels: high (mentions deadline, legal, or cancellation within 24h), medium (frustrated or time-sensitive), low (informational)
- If ticket mentions multiple categories, use the PRIMARY intent (what they want resolved)
- Always include action_required: true if customer is requesting something specific
- If you cannot determine category with confidence, use "other" 
</rules>

<examples>
  <example>
    <input>Subject: Can't login since yesterday
Body: I've tried resetting my password 3 times and it keeps saying "invalid token". I have a presentation tomorrow and need access to my files ASAP.</input>
    <output>{"category": "technical", "urgency": "high", "action_required": true, "summary": "Login failure - password reset broken, time-sensitive"}</output>
  </example>
  <example>
    <input>Subject: Question about annual billing
Body: Hi, I'm currently on monthly and was wondering what the annual pricing looks like for our team of 12.</input>
    <output>{"category": "billing", "urgency": "low", "action_required": false, "summary": "Pricing inquiry - monthly to annual for team of 12"}</output>
  </example>
  <example>
    <input>Subject: WHERE IS MY ORDER???
Body: Order #98765 was supposed to arrive last week. I've contacted support twice with no response. If this isn't resolved today I'm disputing the charge with my bank.</input>
    <output>{"category": "shipping", "urgency": "high", "action_required": true, "summary": "Missing order - threatened chargeback, escalation risk"}</output>
  </example>
</examples>

<input>
{ticket_text}
</input>

Analyze the ticket above. First reason through your classification in <thinking> tags, then provide the JSON output."""

def classify_ticket(ticket_text: str) -> dict:
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": CLASSIFICATION_PROMPT.format(ticket_text=ticket_text)
        }]
    )
    
    # Extract the JSON from the response (after thinking tags)
    text = response.content[0].text
    import json
    json_start = text.rfind('{')
    json_end = text.rfind('}') + 1
    return json.loads(text[json_start:json_end])


# Usage
result = classify_ticket("""Subject: Account access for new team member
Body: We just hired Sarah and need to get her set up on our enterprise plan. 
Can you add her to our account? Her email is sarah@company.com""")

print(result)
# {"category": "account", "urgency": "low", "action_required": true, 
#  "summary": "New user provisioning request - enterprise plan"}

Few-Shot with Chain of Thought (Combined Pattern)

# This pattern shows few-shot examples WITH reasoning chains
# This is the gold standard for production classification systems

PROMPT_WITH_COT_EXAMPLES = """<instructions>
Classify the customer intent. Think step by step, then output JSON.
</instructions>

<examples>
  <example>
    <input>"I've been charged twice for my March subscription"</input>
    <thinking>
    1. Customer mentions being "charged twice" → billing issue
    2. They specify "March subscription" → they have specifics (not confused)  
    3. Double-charge implies they want a refund → action required
    4. No urgency language, no threats → medium urgency (money involved)
    </thinking>
    <output>{"category": "billing", "subcategory": "duplicate_charge", "urgency": "medium", "action_required": true}</output>
  </example>
</examples>

Now classify this input:
<input>{customer_message}</input>"""

🎬 Video Course to Watch

Building with the Claude API (Anthropic's official Skilljar course)

This comprehensive video course from Anthropic covers advanced prompting techniques, tool integration, and structured output patterns. The modules on "Advanced Prompting Techniques" and "Tool Integration" are directly relevant to today's content — they walk through few-shot prompting, XML structuring, and how to get reliable structured output from Claude. Free to access with a Skilljar account.

Also bookmark: Anthropic's Interactive Prompt Engineering Tutorial on GitHub — the notebook 07_Using_Examples_Few-Shot_Prompting.ipynb has hands-on exercises that map directly to exam content.

📖 Reading

Primary: Multishot Prompting Guide — Anthropic's official documentation on few-shot examples
Secondary: Chain of Thought Prompting — How to let Claude think step-by-step
Reference: Prompt Engineering Overview — The full prompt engineering guide from Anthropic

🛠️ Hands-On Exercise (20-30 minutes)

Build a Production Ticket Classifier:

Open the Anthropic Console or your local Python environment
Write a prompt that classifies support tickets into 5 categories: billing, technical, account, shipping, other
Include:
- Explicit rules for each category (when to choose one vs another)
- 3 few-shot examples wrapped in XML tags
- At least 1 edge case example (ambiguous ticket)
- Chain of thought instruction (<thinking> before output)
Test with 5 novel tickets not in your examples
Measure consistency: Run the same ticket 3 times. Is the output identical each time? If not, what's ambiguous in your criteria?

Bonus: After getting it working, deliberately REMOVE the explicit rules and keep only examples. Compare consistency. This demonstrates why rules + examples > examples alone.

📝 Quick Quiz

Question 1: A team is building a customer feedback classifier, but outputs are inconsistent across similar inputs. Their current prompt has 10 few-shot examples but no explicit classification rules. What is the MOST effective first step to improve consistency?

A) Increase to 20 few-shot examples B) Lower the temperature parameter to 0 C) Add explicit classification criteria and rules before the examples D) Switch to a larger model (e.g., Opus instead of Sonnet)

Question 2: When using few-shot prompting with Claude, what is the recommended way to structure examples in the prompt?

A) Use numbered lists (1. Input: ... Output: ...) B) Use XML tags with clear input/output delimiters (<examples><example>...) C) Use markdown code blocks with JSON D) Use plain text separated by blank lines

Question 3: An architect needs Claude to classify tickets AND explain its reasoning for audit purposes. Which approach gives the best combination of structured output and explainability?

A) Ask Claude to "be confident and explain your reasoning" at the end of the prompt B) Use chain of thought (<thinking> tags) followed by structured JSON output, with few-shot examples showing the thinking+output pattern C) Set effort level to "max" and parse the extended thinking output D) Make two separate API calls — one for classification, one for explanation

Answers: Q1: C — Explicit criteria are the foundation. Examples illustrate rules; they don't replace them. More examples without rules just overfits to patterns. Q2: B — XML tags provide unambiguous boundaries. Claude is specifically trained to respond well to XML-structured content. This is stated in Anthropic's docs. Q3: B — CoT with few-shot examples that SHOW the thinking+output pattern teaches Claude the format. This gives auditable reasoning AND structured output in a single call.

🔮 Tomorrow's Preview

Tomorrow (Day 15) we'll tackle tool_use for Structured Output — the technique of defining a "fake" tool whose input schema IS your desired output structure, forcing Claude to produce guaranteed-schema-compliant JSON. This is the most reliable method for structured extraction and a heavy exam topic.