CCA-F Study Day 17/20: Multi-Pass Review & Structured Data Extraction Scenario

Domain 4: Prompt Engineering & Structured Output (~20% of exam)

📌 Today's Focus

Yesterday you nailed JSON schema design and validation-retry loops — the mechanics of getting structured data out of Claude reliably. Today we level up to the architectural pattern for production extraction systems: multi-pass review. This is also the day we deep-dive into the Structured Data Extraction exam scenario — one of the 6 scenarios you might face on exam day.

Multi-pass review is the pattern that separates "works in a demo" from "works in production." It's also one of the exam's favorite testing grounds for the separate-session anti-pattern.

🧠 Core Concepts

1. Multi-Pass Review: The Architecture

Multi-pass review means running multiple independent passes over the same content, where each pass focuses on a single dimension. Think of it like code review — you don't try to catch bugs, style issues, and architecture problems all in one read.

The canonical 4-pass extraction pipeline:

Pass	Focus	Session	Output
1. Extraction	Pull all fields from document	Session A	Raw structured data
2. Validation	Check extracted values against source document	Session B (separate!)	List of discrepancies
3. Confidence Scoring	Assign per-field confidence scores	Session C (separate!)	Confidence annotations
4. Human Review Routing	Flag low-confidence fields for manual verification	Programmatic (no LLM)	Review queue

Why separate sessions? This is the #1 exam trap here. If you run the validator in the same session as the extractor, the model has reasoning context bias — it already "decided" what the data should be, and will tend to confirm its own extraction rather than catch errors. A fresh session sees the document with no prior commitment.

2. Why Each Pass Gets Its Own Session

Let's make this concrete with an analogy:

Same-session review = asking the person who wrote the code to also review it. They'll gloss over their own mistakes because they already "know" what it should do.
Separate-session review = asking a fresh reviewer. They read the actual text on screen, not what was "intended."

In Claude's architecture, the entire conversation history (including the extraction reasoning) stays in context. A "verifier" in the same session literally has access to why the extractor chose each value — creating confirmation bias.

3. Per-Field Confidence vs. Aggregate Accuracy

The exam distinguishes between two approaches:

❌ Aggregate accuracy — "The system is 94% accurate across all documents." This masks failuresin specific document types. If you're 99% accurate on invoices but 60% on handwritten notes, the aggregate hides the problem.
✅ Per-document-type tracking — Track accuracy for each document category independently. This surfaces failure modes early.
✅ Per-field confidence — Each extracted field gets its own confidence score based on structured criteria (not self-reported confidence!).

4. The Structured Data Extraction Exam Scenario

This is one of the 6 scenarios randomly selected for your exam. Here's what it tests:

What's Tested	Expected Knowledge
JSON schema design	Proper use of types, enums, required, additionalProperties: false
tool_use for extraction	Forced tool_choice to guarantee schema compliance
Validation-retry loops	Extract → validate → retry with error feedback
Multi-pass architecture	Separate sessions for extraction vs. validation
Few-shot examples	XML-structured examples for edge cases
Confidence scoring	Structured criteria, NOT self-reported confidence
Human-in-the-loop routing	Programmatic thresholds for human review

5. Structured Confidence Scoring (The Right Way)

The exam loves to test this distinction:

❌ Self-reported confidence: Asking Claude "how confident are you?" — The model's self-assessment of confidence is unreliable and doesn't correlate well with actual accuracy.
✅ Structured criteria: Programmatically determining confidence based on measurable factors:

# Structured confidence scoring (deterministic)
def calculate_field_confidence(field_name, extracted_value, source_document):
    score = 1.0
    
    # Factor 1: Was the field found in the expected location?
    if not found_in_expected_section(field_name, source_document):
        score -= 0.3
    
    # Factor 2: Does the value pass format validation?
    if not passes_format_check(field_name, extracted_value):
        score -= 0.4
    
    # Factor 3: Is the value consistent with other extracted fields?
    if not cross_field_consistent(field_name, extracted_value, all_fields):
        score -= 0.2
    
    # Factor 4: Was there ambiguity in the source (multiple possible values)?
    if has_ambiguous_source(field_name, source_document):
        score -= 0.3
    
    return max(0.0, score)

⚠️ Anti-Patterns & Exam Traps

❌ Wrong Answer (Exam Trap)	✅ Correct Approach	Why It's Wrong
Run extraction and validation in the same session	Use separate sessions for each pass	Reasoning context bias — the model confirms its own extraction
Ask Claude "rate your confidence 1-10"	Use structured, programmatic confidence criteria	LLM self-reported confidence doesn't correlate with accuracy
Track only aggregate accuracy (94% overall)	Track per-document-type accuracy	Aggregate masks category-specific failures
Retry without specific error feedback	Include exact validation errors in retry prompt	Model needs to know WHAT failed to fix it
Single-pass extraction for production	Multi-pass with separate extraction, validation, confidence	Single pass has no error-catching mechanism
Use prompt instructions alone to enforce output format	Use tool_use with forced tool_choice or Structured Outputs	Prompting can't guarantee schema compliance

💻 Code Examples

Complete Multi-Pass Extraction Pipeline

import anthropic
import json
from typing import Any

client = anthropic.Anthropic()

# ====== PASS 1: EXTRACTION (Session A) ======
def extract_data(document_text: str) -> dict:
    """First pass: extract all fields from the document."""
    
    extraction_tool = {
        "name": "extract_invoice",
        "description": "Extract structured invoice data from the document",
        "input_schema": {
            "type": "object",
            "properties": {
                "vendor_name": {"type": "string", "description": "Company name of the vendor"},
                "invoice_number": {"type": "string", "description": "Invoice ID/number"},
                "date": {"type": "string", "description": "Invoice date in ISO 8601 (YYYY-MM-DD)"},
                "total_amount": {"type": "number", "description": "Total amount due"},
                "currency": {"type": "string", "enum": ["USD", "EUR", "GBP", "CAD"]},
                "line_items": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                            "description": {"type": "string"},
                            "quantity": {"type": "integer"},
                            "unit_price": {"type": "number"},
                            "total": {"type": "number"}
                        },
                        "required": ["description", "quantity", "unit_price", "total"]
                    }
                },
                "payment_terms": {"type": "string", "description": "e.g., Net 30, Due on receipt"}
            },
            "required": ["vendor_name", "invoice_number", "date", "total_amount", "currency", "line_items"],
            "additionalProperties": false
        }
    }
    
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=2048,
        tools=[extraction_tool],
        tool_choice={"type": "tool", "name": "extract_invoice"},
        messages=[{
            "role": "user",
            "content": f"""Extract all invoice data from this document. 
Be precise — copy values exactly as they appear.

<document>
{document_text}
</document>"""
        }]
    )
    
    tool_use = next(b for b in response.content if b.type == "tool_use")
    return tool_use.input


# ====== PASS 2: VALIDATION (Session B — SEPARATE!) ======
def validate_extraction(document_text: str, extracted_data: dict) -> list[str]:
    """Second pass: verify extracted data against source. MUST be a new session."""
    
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": f"""You are a data validation specialist. Compare the extracted data 
against the original document. Report ANY discrepancies.

<original_document>
{document_text}
</original_document>

<extracted_data>
{json.dumps(extracted_data, indent=2)}
</extracted_data>

List each discrepancy as a specific, actionable error. 
If everything matches, respond with "NO_ERRORS".
Format: one error per line, e.g., "total_amount: extracted 150.00 but document shows 155.00"
"""
        }]
    )
    
    result_text = response.content[0].text
    if "NO_ERRORS" in result_text:
        return []
    return [line.strip() for line in result_text.strip().split("\n") if line.strip()]


# ====== PASS 3: CONFIDENCE SCORING (Programmatic — no LLM needed) ======
def score_confidence(extracted_data: dict, validation_errors: list[str]) -> dict:
    """Third pass: assign per-field confidence. Programmatic, not LLM-based."""
    
    confidence_scores = {}
    error_fields = set()
    
    for error in validation_errors:
        field_name = error.split(":")[0].strip()
        error_fields.add(field_name)
    
    for field in extracted_data:
        score = 1.0
        if field in error_fields:
            score -= 0.5
        if extracted_data[field] is None or extracted_data[field] == "":
            score -= 0.3
        if field == "date" and not is_valid_iso_date(extracted_data.get(field, "")):
            score -= 0.4
        confidence_scores[field] = max(0.0, round(score, 2))
    
    return confidence_scores


# ====== PASS 4: HUMAN ROUTING (Programmatic) ======
def route_for_review(confidence_scores: dict, threshold: float = 0.7) -> dict:
    """Fourth pass: flag low-confidence fields for human review."""
    return {
        "auto_approved": {k: v for k, v in confidence_scores.items() if v >= threshold},
        "needs_human_review": {k: v for k, v in confidence_scores.items() if v < threshold},
        "review_required": any(v < threshold for v in confidence_scores.values())
    }


# ====== ORCHESTRATOR ======
def run_extraction_pipeline(document_text: str) -> dict:
    """Full multi-pass pipeline with retry logic."""
    MAX_RETRIES = 3
    
    # Pass 1: Extract
    extracted = extract_data(document_text)
    
    # Pass 2: Validate (separate session!)
    errors = validate_extraction(document_text, extracted)
    
    # Retry loop if validation finds issues
    retry_count = 0
    while errors and retry_count < MAX_RETRIES:
        # Re-extract with specific error feedback
        messages = [
            {"role": "user", "content": f"Extract invoice data from:\n{document_text}"},
            {"role": "assistant", "content": "I'll extract the data now."},
            {"role": "user", "content": f"Your previous extraction had these errors: {errors}. Please re-extract carefully, fixing these specific issues."}
        ]
        response = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=2048,
            tools=[extraction_tool],
            tool_choice={"type": "tool", "name": "extract_invoice"},
            messages=messages
        )
        tool_use = next(b for b in response.content if b.type == "tool_use")
        extracted = tool_use.input
        errors = validate_extraction(document_text, extracted)
        retry_count += 1
    
    # Pass 3: Confidence scoring
    confidence = score_confidence(extracted, errors)
    
    # Pass 4: Route
    routing = route_for_review(confidence)
    
    return {
        "extracted_data": extracted,
        "validation_errors": errors,
        "confidence_scores": confidence,
        "routing": routing,
        "retries_used": retry_count
    }

Why This Architecture Passes the Exam

Separate sessions for extraction and validation (no reasoning context bias)
Forced tool_choice guarantees structured output
Validation-retry loop with specific error feedback
Programmatic confidence (not self-reported)
Human routing based on structured thresholds
additionalProperties: false prevents schema drift

🎬 Video to Watch

"How We Build Effective Agents" — Barry Zhang (Anthropic), AI Engineer Summit 2025

Search on YouTube: "How We Build Effective Agents Barry Zhang Anthropic" (on the AI Engineer channel)

Barry covers the multi-agent patterns Anthropic uses in production, including the generator-verifier pattern with separate sessions and why simple composable patterns beat complex frameworks. The section on "think like the agent" is especially relevant to understanding why multi-pass review with session isolation works better than single-pass.

Also highly recommended reading: How We Built Our Multi-Agent Research System — Anthropic's engineering blog showing multi-pass parallel agents in production (90.2% improvement over single-agent).

📖 Reading

Primary: Anthropic Prompt Engineering Overview — Focus on the "Chain complex prompts" section about breaking tasks into subtasks
Deep dive: How We Built Our Multi-Agent Research System — Anthropic's engineering blog on parallel agents with separate contexts
Reference: Anthropic Courses: Structured Outputs Notebook

🛠️ Hands-On Exercise (20-30 minutes)

Build a 3-pass medical record extraction pipeline:

Pass 1 (Extraction): Define a tool schema for extracting patient info (name, DOB, medications, allergies, diagnoses). Use forced tool_choice.
Pass 2 (Validation): In a new API call (simulating a separate session), send the original text + extracted data and ask Claude to list discrepancies.
Pass 3 (Confidence + Routing): Write a Python function that scores each field based on: (a) whether it passed validation, (b) format correctness, (c) cross-field consistency. Route fields below 0.7 confidence to a "human review" queue.

Bonus: Add a retry loop between Pass 1 and Pass 2 — if validation finds errors, re-extract with the error list appended to the prompt.

📝 Quick Quiz

Question 1: A team is building a document extraction pipeline. The extraction agent extracts invoice fields, then in the same conversation, is asked "Now verify your extraction is correct." What is the primary risk?

The model will refuse to verify its own work
The verification will exceed the context window
Reasoning context bias — the model already committed to its extraction and will confirm rather than catch errors
The token cost will be too high for production

Question 2: Which approach to confidence scoring is recommended for the Structured Data Extraction scenario?

Ask Claude to rate its confidence on a scale of 1-10 for each field
Use programmatic criteria: format validation, cross-field consistency, and source location checks
Run the extraction 5 times and use majority vote
Use the model's logprobs to determine confidence

Question 3: A system tracks overall extraction accuracy at 94% across all document types. Medical records have 62% accuracy while invoices have 99%. What anti-pattern does this demonstrate?

Same-session self-review
Aggregate accuracy masking per-document-type failures
Too many tools per agent
Prompt-based enforcement instead of hooks

Answers:

Q1: C — Reasoning context bias. The model's extraction reasoning is in context, making it predisposed to confirm its own choices. The fix: separate sessions.
Q2: B — Programmatic criteria. Self-reported confidence (A) is an explicit anti-pattern. Logprobs (D) aren't reliable for field-level confidence. Majority vote (C) is expensive and doesn't address root cause.
Q3: B — Aggregate accuracy masking category-specific failures. The 94% overall looks good but hides that medical records are failing. The fix: track per-document-type metrics separately.

🔮 Tomorrow's Preview

Day 18 kicks off Domain 5: Context Management & Reliability — we'll tackle context windows, progressive summarization, and the 5 risks of condensing context. This is the domain that separates "toy demos" from "production systems" and is worth 15% of the exam.