CCA-F Study Day 17/20: Multi-Pass Review & Structured Data Extraction Scenario
Domain 4: Prompt Engineering & Structured Output (~20% of exam)
📌 Today's Focus
Yesterday you nailed JSON schema design and validation-retry loops — the mechanics of getting structured data out of Claude reliably. Today we level up to the architectural pattern for production extraction systems: multi-pass review. This is also the day we deep-dive into the Structured Data Extraction exam scenario — one of the 6 scenarios you might face on exam day.
Multi-pass review is the pattern that separates "works in a demo" from "works in production." It's also one of the exam's favorite testing grounds for the separate-session anti-pattern.
🧠 Core Concepts
1. Multi-Pass Review: The Architecture
Multi-pass review means running multiple independent passes over the same content, where each pass focuses on a single dimension. Think of it like code review — you don't try to catch bugs, style issues, and architecture problems all in one read.
The canonical 4-pass extraction pipeline:
| Pass | Focus | Session | Output |
|---|---|---|---|
| 1. Extraction | Pull all fields from document | Session A | Raw structured data |
| 2. Validation | Check extracted values against source document | Session B (separate!) | List of discrepancies |
| 3. Confidence Scoring | Assign per-field confidence scores | Session C (separate!) | Confidence annotations |
| 4. Human Review Routing | Flag low-confidence fields for manual verification | Programmatic (no LLM) | Review queue |
Why separate sessions? This is the #1 exam trap here. If you run the validator in the same session as the extractor, the model has reasoning context bias — it already "decided" what the data should be, and will tend to confirm its own extraction rather than catch errors. A fresh session sees the document with no prior commitment.
2. Why Each Pass Gets Its Own Session
Let's make this concrete with an analogy:
- Same-session review = asking the person who wrote the code to also review it. They'll gloss over their own mistakes because they already "know" what it should do.
- Separate-session review = asking a fresh reviewer. They read the actual text on screen, not what was "intended."
In Claude's architecture, the entire conversation history (including the extraction reasoning) stays in context. A "verifier" in the same session literally has access to why the extractor chose each value — creating confirmation bias.
3. Per-Field Confidence vs. Aggregate Accuracy
The exam distinguishes between two approaches:
- ❌ Aggregate accuracy — "The system is 94% accurate across all documents." This masks failuresin specific document types. If you're 99% accurate on invoices but 60% on handwritten notes, the aggregate hides the problem.
- ✅ Per-document-type tracking — Track accuracy for each document category independently. This surfaces failure modes early.
- ✅ Per-field confidence — Each extracted field gets its own confidence score based on structured criteria (not self-reported confidence!).
4. The Structured Data Extraction Exam Scenario
This is one of the 6 scenarios randomly selected for your exam. Here's what it tests:
| What's Tested | Expected Knowledge |
|---|---|
| JSON schema design | Proper use of types, enums, required, additionalProperties: false |
| tool_use for extraction | Forced tool_choice to guarantee schema compliance |
| Validation-retry loops | Extract → validate → retry with error feedback |
| Multi-pass architecture | Separate sessions for extraction vs. validation |
| Few-shot examples | XML-structured examples for edge cases |
| Confidence scoring | Structured criteria, NOT self-reported confidence |
| Human-in-the-loop routing | Programmatic thresholds for human review |
5. Structured Confidence Scoring (The Right Way)
The exam loves to test this distinction:
- ❌ Self-reported confidence: Asking Claude "how confident are you?" — The model's self-assessment of confidence is unreliable and doesn't correlate well with actual accuracy.
- ✅ Structured criteria: Programmatically determining confidence based on measurable factors:
# Structured confidence scoring (deterministic)
def calculate_field_confidence(field_name, extracted_value, source_document):
score = 1.0
# Factor 1: Was the field found in the expected location?
if not found_in_expected_section(field_name, source_document):
score -= 0.3
# Factor 2: Does the value pass format validation?
if not passes_format_check(field_name, extracted_value):
score -= 0.4
# Factor 3: Is the value consistent with other extracted fields?
if not cross_field_consistent(field_name, extracted_value, all_fields):
score -= 0.2
# Factor 4: Was there ambiguity in the source (multiple possible values)?
if has_ambiguous_source(field_name, source_document):
score -= 0.3
return max(0.0, score)
⚠️ Anti-Patterns & Exam Traps
| ❌ Wrong Answer (Exam Trap) | ✅ Correct Approach | Why It's Wrong |
|---|---|---|
| Run extraction and validation in the same session | Use separate sessions for each pass | Reasoning context bias — the model confirms its own extraction |
| Ask Claude "rate your confidence 1-10" | Use structured, programmatic confidence criteria | LLM self-reported confidence doesn't correlate with accuracy |
| Track only aggregate accuracy (94% overall) | Track per-document-type accuracy | Aggregate masks category-specific failures |
| Retry without specific error feedback | Include exact validation errors in retry prompt | Model needs to know WHAT failed to fix it |
| Single-pass extraction for production | Multi-pass with separate extraction, validation, confidence | Single pass has no error-catching mechanism |
| Use prompt instructions alone to enforce output format | Use tool_use with forced tool_choice or Structured Outputs | Prompting can't guarantee schema compliance |
💻 Code Examples
Complete Multi-Pass Extraction Pipeline
import anthropic
import json
from typing import Any
client = anthropic.Anthropic()
# ====== PASS 1: EXTRACTION (Session A) ======
def extract_data(document_text: str) -> dict:
"""First pass: extract all fields from the document."""
extraction_tool = {
"name": "extract_invoice",
"description": "Extract structured invoice data from the document",
"input_schema": {
"type": "object",
"properties": {
"vendor_name": {"type": "string", "description": "Company name of the vendor"},
"invoice_number": {"type": "string", "description": "Invoice ID/number"},
"date": {"type": "string", "description": "Invoice date in ISO 8601 (YYYY-MM-DD)"},
"total_amount": {"type": "number", "description": "Total amount due"},
"currency": {"type": "string", "enum": ["USD", "EUR", "GBP", "CAD"]},
"line_items": {
"type": "array",
"items": {
"type": "object",
"properties": {
"description": {"type": "string"},
"quantity": {"type": "integer"},
"unit_price": {"type": "number"},
"total": {"type": "number"}
},
"required": ["description", "quantity", "unit_price", "total"]
}
},
"payment_terms": {"type": "string", "description": "e.g., Net 30, Due on receipt"}
},
"required": ["vendor_name", "invoice_number", "date", "total_amount", "currency", "line_items"],
"additionalProperties": false
}
}
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=2048,
tools=[extraction_tool],
tool_choice={"type": "tool", "name": "extract_invoice"},
messages=[{
"role": "user",
"content": f"""Extract all invoice data from this document.
Be precise — copy values exactly as they appear.
<document>
{document_text}
</document>"""
}]
)
tool_use = next(b for b in response.content if b.type == "tool_use")
return tool_use.input
# ====== PASS 2: VALIDATION (Session B — SEPARATE!) ======
def validate_extraction(document_text: str, extracted_data: dict) -> list[str]:
"""Second pass: verify extracted data against source. MUST be a new session."""
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[{
"role": "user",
"content": f"""You are a data validation specialist. Compare the extracted data
against the original document. Report ANY discrepancies.
<original_document>
{document_text}
</original_document>
<extracted_data>
{json.dumps(extracted_data, indent=2)}
</extracted_data>
List each discrepancy as a specific, actionable error.
If everything matches, respond with "NO_ERRORS".
Format: one error per line, e.g., "total_amount: extracted 150.00 but document shows 155.00"
"""
}]
)
result_text = response.content[0].text
if "NO_ERRORS" in result_text:
return []
return [line.strip() for line in result_text.strip().split("\n") if line.strip()]
# ====== PASS 3: CONFIDENCE SCORING (Programmatic — no LLM needed) ======
def score_confidence(extracted_data: dict, validation_errors: list[str]) -> dict:
"""Third pass: assign per-field confidence. Programmatic, not LLM-based."""
confidence_scores = {}
error_fields = set()
for error in validation_errors:
field_name = error.split(":")[0].strip()
error_fields.add(field_name)
for field in extracted_data:
score = 1.0
if field in error_fields:
score -= 0.5
if extracted_data[field] is None or extracted_data[field] == "":
score -= 0.3
if field == "date" and not is_valid_iso_date(extracted_data.get(field, "")):
score -= 0.4
confidence_scores[field] = max(0.0, round(score, 2))
return confidence_scores
# ====== PASS 4: HUMAN ROUTING (Programmatic) ======
def route_for_review(confidence_scores: dict, threshold: float = 0.7) -> dict:
"""Fourth pass: flag low-confidence fields for human review."""
return {
"auto_approved": {k: v for k, v in confidence_scores.items() if v >= threshold},
"needs_human_review": {k: v for k, v in confidence_scores.items() if v < threshold},
"review_required": any(v < threshold for v in confidence_scores.values())
}
# ====== ORCHESTRATOR ======
def run_extraction_pipeline(document_text: str) -> dict:
"""Full multi-pass pipeline with retry logic."""
MAX_RETRIES = 3
# Pass 1: Extract
extracted = extract_data(document_text)
# Pass 2: Validate (separate session!)
errors = validate_extraction(document_text, extracted)
# Retry loop if validation finds issues
retry_count = 0
while errors and retry_count < MAX_RETRIES:
# Re-extract with specific error feedback
messages = [
{"role": "user", "content": f"Extract invoice data from:\n{document_text}"},
{"role": "assistant", "content": "I'll extract the data now."},
{"role": "user", "content": f"Your previous extraction had these errors: {errors}. Please re-extract carefully, fixing these specific issues."}
]
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=2048,
tools=[extraction_tool],
tool_choice={"type": "tool", "name": "extract_invoice"},
messages=messages
)
tool_use = next(b for b in response.content if b.type == "tool_use")
extracted = tool_use.input
errors = validate_extraction(document_text, extracted)
retry_count += 1
# Pass 3: Confidence scoring
confidence = score_confidence(extracted, errors)
# Pass 4: Route
routing = route_for_review(confidence)
return {
"extracted_data": extracted,
"validation_errors": errors,
"confidence_scores": confidence,
"routing": routing,
"retries_used": retry_count
}
Why This Architecture Passes the Exam
- Separate sessions for extraction and validation (no reasoning context bias)
- Forced tool_choice guarantees structured output
- Validation-retry loop with specific error feedback
- Programmatic confidence (not self-reported)
- Human routing based on structured thresholds
- additionalProperties: false prevents schema drift
🎬 Video to Watch
"How We Build Effective Agents" — Barry Zhang (Anthropic), AI Engineer Summit 2025
Search on YouTube: "How We Build Effective Agents Barry Zhang Anthropic" (on the AI Engineer channel)
Barry covers the multi-agent patterns Anthropic uses in production, including the generator-verifier pattern with separate sessions and why simple composable patterns beat complex frameworks. The section on "think like the agent" is especially relevant to understanding why multi-pass review with session isolation works better than single-pass.
Also highly recommended reading: How We Built Our Multi-Agent Research System — Anthropic's engineering blog showing multi-pass parallel agents in production (90.2% improvement over single-agent).
📖 Reading
- Primary: Anthropic Prompt Engineering Overview — Focus on the "Chain complex prompts" section about breaking tasks into subtasks
- Deep dive: How We Built Our Multi-Agent Research System — Anthropic's engineering blog on parallel agents with separate contexts
- Reference: Anthropic Courses: Structured Outputs Notebook
🛠️ Hands-On Exercise (20-30 minutes)
Build a 3-pass medical record extraction pipeline:
- Pass 1 (Extraction): Define a tool schema for extracting patient info (name, DOB, medications, allergies, diagnoses). Use forced tool_choice.
- Pass 2 (Validation): In a new API call (simulating a separate session), send the original text + extracted data and ask Claude to list discrepancies.
- Pass 3 (Confidence + Routing): Write a Python function that scores each field based on: (a) whether it passed validation, (b) format correctness, (c) cross-field consistency. Route fields below 0.7 confidence to a "human review" queue.
Bonus: Add a retry loop between Pass 1 and Pass 2 — if validation finds errors, re-extract with the error list appended to the prompt.
📝 Quick Quiz
Question 1: A team is building a document extraction pipeline. The extraction agent extracts invoice fields, then in the same conversation, is asked "Now verify your extraction is correct." What is the primary risk?
- The model will refuse to verify its own work
- The verification will exceed the context window
- Reasoning context bias — the model already committed to its extraction and will confirm rather than catch errors
- The token cost will be too high for production
Question 2: Which approach to confidence scoring is recommended for the Structured Data Extraction scenario?
- Ask Claude to rate its confidence on a scale of 1-10 for each field
- Use programmatic criteria: format validation, cross-field consistency, and source location checks
- Run the extraction 5 times and use majority vote
- Use the model's logprobs to determine confidence
Question 3: A system tracks overall extraction accuracy at 94% across all document types. Medical records have 62% accuracy while invoices have 99%. What anti-pattern does this demonstrate?
- Same-session self-review
- Aggregate accuracy masking per-document-type failures
- Too many tools per agent
- Prompt-based enforcement instead of hooks
Answers:
- Q1: C — Reasoning context bias. The model's extraction reasoning is in context, making it predisposed to confirm its own choices. The fix: separate sessions.
- Q2: B — Programmatic criteria. Self-reported confidence (A) is an explicit anti-pattern. Logprobs (D) aren't reliable for field-level confidence. Majority vote (C) is expensive and doesn't address root cause.
- Q3: B — Aggregate accuracy masking category-specific failures. The 94% overall looks good but hides that medical records are failing. The fix: track per-document-type metrics separately.
🔮 Tomorrow's Preview
Day 18 kicks off Domain 5: Context Management & Reliability — we'll tackle context windows, progressive summarization, and the 5 risks of condensing context. This is the domain that separates "toy demos" from "production systems" and is worth 15% of the exam.