CCA-F Study Day 20/20: Human Review, Information Provenance & Final Review

Domain 5: Context Management & Reliability (~15% of exam)

📌 Today's Focus

Congratulations — you've reached the final day of the 20-day study plan! 🎉

Yesterday you mastered escalation patterns and error propagation — the circuit breaker, structured escalation triggers, and cascading failure prevention. Today we close out Domain 5 with three critical concepts: human-in-the-loop patterns, information provenance tracking, and context positioning effects. Then we'll do a comprehensive review of ALL 10 anti-patterns and all 6 exam scenarios — the single most exam-relevant material across the entire certification.

This day is designed to be your "capstone" — tying together everything from the past 19 days into a unified mental model for the exam.

📚 Core Concepts

1. Human-in-the-Loop Patterns

Production agentic systems are NOT fully autonomous — they require strategic human involvement. The exam tests your ability to design these touch-points architecturally, not as afterthoughts.

Five canonical HITL patterns:

Pattern	When Used	Implementation
Approval Gates	Before high-stakes actions (refunds >$500, data deletion)	Hook-based: PreToolUse hook pauses execution, queues for human
Review Loops	Before delivery of customer-facing content	Agent generates → human reviews → approve/reject/edit
Escalation Paths	When agent cannot proceed (policy gap, repeated failure)	Structured triggers based on programmatic criteria
Audit Trails	Post-hoc compliance verification	PostToolUse hooks log every decision with full context
Confidence Thresholds	When extraction quality is uncertain	Programmatic validation checks (NOT self-reported confidence)

Critical exam distinction: "Confidence thresholds" here means programmatic checks— like a validation schema that rejects partial matches, or a business rule that flags amounts below a certain threshold. It does NOT mean asking the model "how confident are you?" — that's the anti-pattern.

2. Information Provenance

As data flows through multi-agent systems, you lose track of where information came from. Provenance tracking combats this by attaching metadata to every piece of information.

Why it matters for the exam:

Hallucination detection: If a "fact" has no provenance, it may be hallucinated
Human review efficiency: Reviewers can verify claims by checking sources
Error tracing: When output is wrong, you can trace which tool/source introduced the error
Compliance: Regulated industries need audit trails showing data lineage

# Provenance-tracked extraction result
extraction_result = {
    "field": "customer_name",
    "value": "John Smith",
    "source": "invoice_2024_001.pdf",
    "page": 1,
    "confidence": 0.95,           # From programmatic validation, NOT self-reported
    "extraction_method": "tool_use",
    "verified_by": None,           # Will be set when human reviews
    "retrieved_at": "2026-06-11T10:30:00Z"
}

# Full pipeline provenance
pipeline_result = {
    "final_answer": "Customer has 3 active orders",
    "provenance_chain": [
        {"step": "tool_call", "tool": "search_database", 
         "query": "SELECT COUNT(*) FROM orders WHERE customer_id=123 AND status='active'",
         "timestamp": "2026-06-11T10:30:00Z"},
        {"step": "validation", "check": "result_not_empty", "passed": True},
        {"step": "formatting", "template": "customer_order_count"}
    ]
}

3. Context Positioning: Primacy & Recency Effects

Where you place information in the context window directly affects how much attention the model gives it:

Position	Attention Level	Best For
Beginning(system prompt)	🟢 Highest (primacy effect)	Critical rules, persona, constraints, immutable instructions
End (recent turns)	🟢 High (recency effect)	Current task context, latest user instructions
Middle	🔴 Lowest ("lost in the middle")	Reference data, examples — accessed but given less weight

Practical implications:

Put immutable rules in the system prompt — they survive compaction
Put current task requirements in the most recent user message
If a critical instruction was given 20 turns ago, repeat it in the latest message
Use XML tags to visually highlight critical sections within long contexts

4. Extended Thinking & Context

Key facts about thinking tokens and context:

Thinking tokens count toward the context window during generation
Previous thinking blocks are automatically stripped from subsequent turns — they don't accumulate
During tool-use cycles, thinking blocks MUST be preserved until the cycle completes
Thinking tokens are billed once during generation, not carried forward

🚫 Anti-Patterns & Exam Traps — THE COMPLETE LIST

This is the single most important table for exam day. The exam presents these anti-patterns as plausible-sounding wrong answers. Memorize all 10:

#	❌ Anti-Pattern (Wrong Answer)	✅ Correct Approach	Domain
1	Parsing natural language for loop termination	Check stop_reason field ("tool_use" vs "end_turn")	D1
2	Arbitrary iteration caps as primary stopping mechanism	Let loop terminate naturally via stop_reason	D1
3	Prompt-based enforcement for critical business rules	Programmatic hooks (deterministic, can't be bypassed)	D1/D3
4	Self-reported confidence scores for escalation	Structured criteria + programmatic checks	D5
5	Sentiment-based escalation ("customer sounds angry")	Task complexity, policy gaps, financial thresholds	D5
6	Generic error messages ("Operation failed")	Rich errors: errorCategory, isRetryable, context	D2
7	Silently suppressing errors (empty results as success)	Explicitly distinguish failures from empty results	D2
8	Too many tools per agent (18+)	4-5 tools per agent, distributed across subagents	D2
9	Same-session self-review	Separate sessions to avoid reasoning context bias	D1/D4
10	Aggregate accuracy metrics only	Per-document-type accuracy tracking	D4

Exam strategy: When you see a question, first identify which anti-pattern the wrong answers represent. Usually 2-3 of the 4 choices will be recognizable anti-patterns.

🎯 All 6 Exam Scenarios — Domain Mapping

The exam gives you 4 of these 6 randomly. For each, know which domains they test:

Scenario	Primary Domains	Key Concepts Tested
1. Customer Support Agent	D1 + D2 + D5	Agentic loop, hooks for compliance, structured escalation (NOT sentiment), MCP tools
2. Code Generation (Claude Code)	D3 + D1	CLAUDE.md hierarchy, plan mode, slash commands, TDD iteration, permissions
3. Multi-Agent Research	D1 + D5	Hub-and-spoke, context isolation, error propagation, separate sessions for verification
4. Developer Productivity	D2 + D3	Built-in tools (Read/Grep/Glob), MCP integration, tool selection logic
5. CI/CD with Claude Code	D3 + D4	-p flag, --output-format json, batch API, multi-pass review (separate sessions!)
6. Structured Data Extraction	D4 + D2	JSON schemas, tool_use for structured output, validation-retry loops, few-shot

💻 Code Example: Complete Human-in-the-Loop Agent

import json
from anthropic import Anthropic
from datetime import datetime

client = Anthropic()

# === PROVENANCE TRACKING ===
class ProvenanceTracker:
    """Track source of every piece of information through the pipeline."""
    
    def __init__(self):
        self.chain = []
    
    def record(self, step_type: str, details: dict):
        self.chain.append({
            "step": step_type,
            "timestamp": datetime.utcnow().isoformat() + "Z",
            **details
        })
    
    def get_chain(self):
        return self.chain


# === HUMAN REVIEW GATE ===
class HumanReviewGate:
    """Approval gate for high-stakes actions."""
    
    HIGH_STAKES_TOOLS = {"issue_refund", "delete_account", "modify_subscription"}
    FINANCIAL_THRESHOLD = 500  # dollars
    
    def requires_approval(self, tool_name: str, tool_input: dict) -> bool:
        """Programmatic check — NOT sentiment or self-reported confidence."""
        if tool_name in self.HIGH_STAKES_TOOLS:
            return True
        if tool_name == "issue_refund" and tool_input.get("amount", 0) > self.FINANCIAL_THRESHOLD:
            return True
        return False
    
    def request_approval(self, tool_name: str, tool_input: dict) -> dict:
        """Queue for human approval. In production, this would notify a human."""
        return {
            "status": "pending_approval",
            "action": tool_name,
            "details": tool_input,
            "queued_at": datetime.utcnow().isoformat() + "Z",
            "reason": f"Action '{tool_name}' requires human approval"
        }


# === ESCALATION LOGIC ===
class EscalationManager:
    """Structured, programmatic escalation — NEVER sentiment-based."""
    
    def __init__(self):
        self.tool_failures = {}
    
    def check_escalation(self, context: dict) -> tuple[bool, str]:
        """Returns (should_escalate, reason)."""
        
        # ✅ Trigger 1: Repeated tool failure (circuit breaker)
        tool_name = context.get("last_tool_called")
        if tool_name:
            self.tool_failures[tool_name] = self.tool_failures.get(tool_name, 0) + 1
            if self.tool_failures[tool_name] >= 3:
                return True, f"Tool '{tool_name}' failed 3 consecutive times"
        
        # ✅ Trigger 2: Policy gap detected
        if context.get("question_not_in_knowledge_base"):
            return True, "Policy gap: question not covered by knowledge base"
        
        # ✅ Trigger 3: Financial threshold
        if context.get("refund_amount", 0) > 500:
            return True, f"Refund amount ${context['refund_amount']} exceeds threshold"
        
        # ✅ Trigger 4: Explicit user request
        if context.get("user_requested_human"):
            return True, "User explicitly requested human agent"
        
        # ❌ NOT THIS: sentiment-based escalation
        # if detect_sentiment(message) == "angry":
        #     escalate_to_human()  # WRONG!
        
        return False, ""
    
    def record_success(self, tool_name: str):
        """Reset failure counter on success."""
        self.tool_failures[tool_name] = 0


# === MAIN AGENTIC LOOP WITH HITL ===
def run_agent_with_hitl(user_message: str, tools: list):
    provenance = ProvenanceTracker()
    review_gate = HumanReviewGate()
    escalation = EscalationManager()
    
    messages = [{"role": "user", "content": user_message}]
    provenance.record("user_input", {"message": user_message})
    
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        tools=tools,
        messages=messages
    )
    
    # ✅ CORRECT: Use stop_reason for loop termination
    while response.stop_reason == "tool_use":
        tool_use = next(b for b in response.content if b.type == "tool_use")
        
        # Check human approval gate
        if review_gate.requires_approval(tool_use.name, tool_use.input):
            approval_result = review_gate.request_approval(tool_use.name, tool_use.input)
            provenance.record("human_gate", {"action": tool_use.name, "status": "queued"})
            return {"status": "awaiting_approval", "details": approval_result, 
                    "provenance": provenance.get_chain()}
        
        # Execute tool
        try:
            result = execute_tool(tool_use.name, tool_use.input)
            provenance.record("tool_call", {
                "tool": tool_use.name, 
                "input": tool_use.input,
                "success": True
            })
            escalation.record_success(tool_use.name)
        except Exception as e:
            # ✅ Rich error response
            result = {
                "is_error": True,
                "errorCategory": classify_error(e),
                "isRetryable": is_retryable(e),
                "context": str(e),
                "suggestion": get_recovery_suggestion(e)
            }
            provenance.record("tool_error", {"tool": tool_use.name, "error": str(e)})
            
            # Check escalation
            should_escalate, reason = escalation.check_escalation({
                "last_tool_called": tool_use.name
            })
            if should_escalate:
                provenance.record("escalation", {"reason": reason})
                return {"status": "escalated", "reason": reason, 
                        "provenance": provenance.get_chain()}
        
        # Continue the loop
        messages.append({"role": "assistant", "content": response.content})
        messages.append({
            "role": "user",
            "content": [{"type": "tool_result", "tool_use_id": tool_use.id, 
                        "content": json.dumps(result)}]
        })
        
        response = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=1024,
            tools=tools,
            messages=messages
        )
    
    # Extract final response
    final_text = next((b.text for b in response.content if b.type == "text"), "")
    provenance.record("final_response", {"length": len(final_text)})
    
    return {
        "status": "complete",
        "response": final_text,
        "provenance": provenance.get_chain()
    }

🎬 Video to Watch

Code with Claude London: "Agents That Remember" — This workshop from Anthropic's Code with Claude 2026 London event (May 19) covers persistent memory, session management, and context recovery strategies for long-running agents. Most relevant section: the "Dreaming" batch-consolidation pattern for structuring recall across sessions, which directly relates to today's provenance and context management concepts.

Also worth watching: Code with Claude SF 2026 Opening Keynote — covers Managed Agents and the "lean harness" philosophy of keeping agent loops simple with deterministic guardrails (hooks), directly reinforcing the anti-patterns.

📖 Reading

Primary: Context Windows Documentation — Official Anthropic docs on context management, compaction, and context editing
Secondary: Trustworthy Agents in Practice (April 2026) — Anthropic's framework for responsible agent development, including human oversight patterns
Bonus: Context Engineering Cookbook — Practical compaction, memory, and tool clearing strategies

🛠️ Hands-On Exercise (30 min): "Full Scenario Practice"

For your final exercise, simulate an exam scenario end-to-end:

Pick the Customer Support scenario (it touches the most domains: D1 + D2 + D5)
Design the architecture:
- Draw the agentic loop with stop_reason termination
- Define 4-5 tools with proper descriptions and error schemas
- Add a PreToolUse hook for compliance (blocks PII operations without approval)
- Implement escalation triggers (3 programmatic criteria, NOT sentiment)
- Add provenance tracking to tool results
For each component, identify which anti-pattern the "obvious but wrong" approach would be
Write the escalation logic — use the structured trigger pattern, not sentiment

This exercise synthesizes Days 1, 4, 6, 7, 19, and 20 into a single coherent design.

❓ Quick Quiz

Question 1: A customer support agent needs to determine when to hand off to a human. Which approach is correct?

A) Check the customer's sentiment — if "angry" or "frustrated", escalate B) Ask Claude to rate its own confidence 1-10 — if below 7, escalate C) Escalate when: tool fails 3x consecutively, policy gap detected, or financial threshold exceeded D) Set a timer — if the conversation exceeds 5 minutes, escalate

Question 2: In a multi-agent extraction pipeline, an agent returns the customer name "John Smith" from an invoice. What should the tool result include for production reliability?

A) Just the value: {"customer_name": "John Smith"} B) The value plus a self-reported confidence: {"customer_name": "John Smith", "confidence": 0.9} C) The value with provenance metadata: source file, page number, extraction method, and retrieval timestamp D) The value with a natural language explanation of how it was found

Question 3: Where should critical, immutable instructions be placed to survive context compaction in a long-running agent?

A) In the first user message of the conversation B) In the system prompt C) Repeated in every tool result D) In a CLAUDE.md file referenced via @import

Answers:

1. C — Escalation must be based on structured, programmatic criteria. A and B are anti-patterns #5 and #4 respectively. D is an arbitrary cap (anti-pattern #2 variant).

2. C — Information provenance requires source attribution, timestamps, and extraction method. B is an anti-pattern (self-reported confidence). A lacks traceability. D is unstructured.

3. B — The system prompt has the highest priority position (primacy effect) and survives compaction. Early user messages (A) may be summarized away. C wastes context. D is for Claude Code, not API-based agents.

🔮 What's Next

You've completed the 20-day study plan! 🏆

Starting tomorrow, we enter review mode: cross-domain scenario questions, timed practice, and targeted drills on the concepts you found hardest. The exam is scenario-based — so we'll practice identifying which domain(s) each question targets and eliminating anti-pattern distractors.

Your exam readiness checklist:

☐ Can you list all 10 anti-patterns from memory?
☐ Can you map each scenario to its primary domains?
☐ Can you write the canonical agentic loop from memory?
☐ Can you explain why hooks > prompts for enforcement?
☐ Can you describe MCP's 3-layer architecture?

You've built a strong foundation over 20 days. Trust your preparation. 💪