CCA-F Study Day 20/20: Human Review, Information Provenance & Final Review
Domain 5: Context Management & Reliability (~15% of exam)
📌 Today's Focus
Congratulations — you've reached the final day of the 20-day study plan! 🎉
Yesterday you mastered escalation patterns and error propagation — the circuit breaker, structured escalation triggers, and cascading failure prevention. Today we close out Domain 5 with three critical concepts: human-in-the-loop patterns, information provenance tracking, and context positioning effects. Then we'll do a comprehensive review of ALL 10 anti-patterns and all 6 exam scenarios — the single most exam-relevant material across the entire certification.
This day is designed to be your "capstone" — tying together everything from the past 19 days into a unified mental model for the exam.
📚 Core Concepts
1. Human-in-the-Loop Patterns
Production agentic systems are NOT fully autonomous — they require strategic human involvement. The exam tests your ability to design these touch-points architecturally, not as afterthoughts.
Five canonical HITL patterns:
| Pattern | When Used | Implementation |
|---|---|---|
| Approval Gates | Before high-stakes actions (refunds >$500, data deletion) | Hook-based: PreToolUse hook pauses execution, queues for human |
| Review Loops | Before delivery of customer-facing content | Agent generates → human reviews → approve/reject/edit |
| Escalation Paths | When agent cannot proceed (policy gap, repeated failure) | Structured triggers based on programmatic criteria |
| Audit Trails | Post-hoc compliance verification | PostToolUse hooks log every decision with full context |
| Confidence Thresholds | When extraction quality is uncertain | Programmatic validation checks (NOT self-reported confidence) |
Critical exam distinction: "Confidence thresholds" here means programmatic checks— like a validation schema that rejects partial matches, or a business rule that flags amounts below a certain threshold. It does NOT mean asking the model "how confident are you?" — that's the anti-pattern.
2. Information Provenance
As data flows through multi-agent systems, you lose track of where information came from. Provenance tracking combats this by attaching metadata to every piece of information.
Why it matters for the exam:
- Hallucination detection: If a "fact" has no provenance, it may be hallucinated
- Human review efficiency: Reviewers can verify claims by checking sources
- Error tracing: When output is wrong, you can trace which tool/source introduced the error
- Compliance: Regulated industries need audit trails showing data lineage
# Provenance-tracked extraction result
extraction_result = {
"field": "customer_name",
"value": "John Smith",
"source": "invoice_2024_001.pdf",
"page": 1,
"confidence": 0.95, # From programmatic validation, NOT self-reported
"extraction_method": "tool_use",
"verified_by": None, # Will be set when human reviews
"retrieved_at": "2026-06-11T10:30:00Z"
}
# Full pipeline provenance
pipeline_result = {
"final_answer": "Customer has 3 active orders",
"provenance_chain": [
{"step": "tool_call", "tool": "search_database",
"query": "SELECT COUNT(*) FROM orders WHERE customer_id=123 AND status='active'",
"timestamp": "2026-06-11T10:30:00Z"},
{"step": "validation", "check": "result_not_empty", "passed": True},
{"step": "formatting", "template": "customer_order_count"}
]
}
3. Context Positioning: Primacy & Recency Effects
Where you place information in the context window directly affects how much attention the model gives it:
| Position | Attention Level | Best For |
|---|---|---|
| Beginning(system prompt) | 🟢 Highest (primacy effect) | Critical rules, persona, constraints, immutable instructions |
| End (recent turns) | 🟢 High (recency effect) | Current task context, latest user instructions |
| Middle | 🔴 Lowest ("lost in the middle") | Reference data, examples — accessed but given less weight |
Practical implications:
- Put immutable rules in the system prompt — they survive compaction
- Put current task requirements in the most recent user message
- If a critical instruction was given 20 turns ago, repeat it in the latest message
- Use XML tags to visually highlight critical sections within long contexts
4. Extended Thinking & Context
Key facts about thinking tokens and context:
- Thinking tokens count toward the context window during generation
- Previous thinking blocks are automatically stripped from subsequent turns — they don't accumulate
- During tool-use cycles, thinking blocks MUST be preserved until the cycle completes
- Thinking tokens are billed once during generation, not carried forward
🚫 Anti-Patterns & Exam Traps — THE COMPLETE LIST
This is the single most important table for exam day. The exam presents these anti-patterns as plausible-sounding wrong answers. Memorize all 10:
| # | ❌ Anti-Pattern (Wrong Answer) | ✅ Correct Approach | Domain |
|---|---|---|---|
| 1 | Parsing natural language for loop termination | Check stop_reason field ("tool_use" vs "end_turn") | D1 |
| 2 | Arbitrary iteration caps as primary stopping mechanism | Let loop terminate naturally via stop_reason | D1 |
| 3 | Prompt-based enforcement for critical business rules | Programmatic hooks (deterministic, can't be bypassed) | D1/D3 |
| 4 | Self-reported confidence scores for escalation | Structured criteria + programmatic checks | D5 |
| 5 | Sentiment-based escalation ("customer sounds angry") | Task complexity, policy gaps, financial thresholds | D5 |
| 6 | Generic error messages ("Operation failed") | Rich errors: errorCategory, isRetryable, context | D2 |
| 7 | Silently suppressing errors (empty results as success) | Explicitly distinguish failures from empty results | D2 |
| 8 | Too many tools per agent (18+) | 4-5 tools per agent, distributed across subagents | D2 |
| 9 | Same-session self-review | Separate sessions to avoid reasoning context bias | D1/D4 |
| 10 | Aggregate accuracy metrics only | Per-document-type accuracy tracking | D4 |
Exam strategy: When you see a question, first identify which anti-pattern the wrong answers represent. Usually 2-3 of the 4 choices will be recognizable anti-patterns.
🎯 All 6 Exam Scenarios — Domain Mapping
The exam gives you 4 of these 6 randomly. For each, know which domains they test:
| Scenario | Primary Domains | Key Concepts Tested |
|---|---|---|
| 1. Customer Support Agent | D1 + D2 + D5 | Agentic loop, hooks for compliance, structured escalation (NOT sentiment), MCP tools |
| 2. Code Generation (Claude Code) | D3 + D1 | CLAUDE.md hierarchy, plan mode, slash commands, TDD iteration, permissions |
| 3. Multi-Agent Research | D1 + D5 | Hub-and-spoke, context isolation, error propagation, separate sessions for verification |
| 4. Developer Productivity | D2 + D3 | Built-in tools (Read/Grep/Glob), MCP integration, tool selection logic |
| 5. CI/CD with Claude Code | D3 + D4 | -p flag, --output-format json, batch API, multi-pass review (separate sessions!) |
| 6. Structured Data Extraction | D4 + D2 | JSON schemas, tool_use for structured output, validation-retry loops, few-shot |
💻 Code Example: Complete Human-in-the-Loop Agent
import json
from anthropic import Anthropic
from datetime import datetime
client = Anthropic()
# === PROVENANCE TRACKING ===
class ProvenanceTracker:
"""Track source of every piece of information through the pipeline."""
def __init__(self):
self.chain = []
def record(self, step_type: str, details: dict):
self.chain.append({
"step": step_type,
"timestamp": datetime.utcnow().isoformat() + "Z",
**details
})
def get_chain(self):
return self.chain
# === HUMAN REVIEW GATE ===
class HumanReviewGate:
"""Approval gate for high-stakes actions."""
HIGH_STAKES_TOOLS = {"issue_refund", "delete_account", "modify_subscription"}
FINANCIAL_THRESHOLD = 500 # dollars
def requires_approval(self, tool_name: str, tool_input: dict) -> bool:
"""Programmatic check — NOT sentiment or self-reported confidence."""
if tool_name in self.HIGH_STAKES_TOOLS:
return True
if tool_name == "issue_refund" and tool_input.get("amount", 0) > self.FINANCIAL_THRESHOLD:
return True
return False
def request_approval(self, tool_name: str, tool_input: dict) -> dict:
"""Queue for human approval. In production, this would notify a human."""
return {
"status": "pending_approval",
"action": tool_name,
"details": tool_input,
"queued_at": datetime.utcnow().isoformat() + "Z",
"reason": f"Action '{tool_name}' requires human approval"
}
# === ESCALATION LOGIC ===
class EscalationManager:
"""Structured, programmatic escalation — NEVER sentiment-based."""
def __init__(self):
self.tool_failures = {}
def check_escalation(self, context: dict) -> tuple[bool, str]:
"""Returns (should_escalate, reason)."""
# ✅ Trigger 1: Repeated tool failure (circuit breaker)
tool_name = context.get("last_tool_called")
if tool_name:
self.tool_failures[tool_name] = self.tool_failures.get(tool_name, 0) + 1
if self.tool_failures[tool_name] >= 3:
return True, f"Tool '{tool_name}' failed 3 consecutive times"
# ✅ Trigger 2: Policy gap detected
if context.get("question_not_in_knowledge_base"):
return True, "Policy gap: question not covered by knowledge base"
# ✅ Trigger 3: Financial threshold
if context.get("refund_amount", 0) > 500:
return True, f"Refund amount ${context['refund_amount']} exceeds threshold"
# ✅ Trigger 4: Explicit user request
if context.get("user_requested_human"):
return True, "User explicitly requested human agent"
# ❌ NOT THIS: sentiment-based escalation
# if detect_sentiment(message) == "angry":
# escalate_to_human() # WRONG!
return False, ""
def record_success(self, tool_name: str):
"""Reset failure counter on success."""
self.tool_failures[tool_name] = 0
# === MAIN AGENTIC LOOP WITH HITL ===
def run_agent_with_hitl(user_message: str, tools: list):
provenance = ProvenanceTracker()
review_gate = HumanReviewGate()
escalation = EscalationManager()
messages = [{"role": "user", "content": user_message}]
provenance.record("user_input", {"message": user_message})
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
tools=tools,
messages=messages
)
# ✅ CORRECT: Use stop_reason for loop termination
while response.stop_reason == "tool_use":
tool_use = next(b for b in response.content if b.type == "tool_use")
# Check human approval gate
if review_gate.requires_approval(tool_use.name, tool_use.input):
approval_result = review_gate.request_approval(tool_use.name, tool_use.input)
provenance.record("human_gate", {"action": tool_use.name, "status": "queued"})
return {"status": "awaiting_approval", "details": approval_result,
"provenance": provenance.get_chain()}
# Execute tool
try:
result = execute_tool(tool_use.name, tool_use.input)
provenance.record("tool_call", {
"tool": tool_use.name,
"input": tool_use.input,
"success": True
})
escalation.record_success(tool_use.name)
except Exception as e:
# ✅ Rich error response
result = {
"is_error": True,
"errorCategory": classify_error(e),
"isRetryable": is_retryable(e),
"context": str(e),
"suggestion": get_recovery_suggestion(e)
}
provenance.record("tool_error", {"tool": tool_use.name, "error": str(e)})
# Check escalation
should_escalate, reason = escalation.check_escalation({
"last_tool_called": tool_use.name
})
if should_escalate:
provenance.record("escalation", {"reason": reason})
return {"status": "escalated", "reason": reason,
"provenance": provenance.get_chain()}
# Continue the loop
messages.append({"role": "assistant", "content": response.content})
messages.append({
"role": "user",
"content": [{"type": "tool_result", "tool_use_id": tool_use.id,
"content": json.dumps(result)}]
})
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
tools=tools,
messages=messages
)
# Extract final response
final_text = next((b.text for b in response.content if b.type == "text"), "")
provenance.record("final_response", {"length": len(final_text)})
return {
"status": "complete",
"response": final_text,
"provenance": provenance.get_chain()
}
🎬 Video to Watch
Code with Claude London: "Agents That Remember" — This workshop from Anthropic's Code with Claude 2026 London event (May 19) covers persistent memory, session management, and context recovery strategies for long-running agents. Most relevant section: the "Dreaming" batch-consolidation pattern for structuring recall across sessions, which directly relates to today's provenance and context management concepts.
Also worth watching: Code with Claude SF 2026 Opening Keynote — covers Managed Agents and the "lean harness" philosophy of keeping agent loops simple with deterministic guardrails (hooks), directly reinforcing the anti-patterns.
📖 Reading
- Primary: Context Windows Documentation — Official Anthropic docs on context management, compaction, and context editing
- Secondary: Trustworthy Agents in Practice (April 2026) — Anthropic's framework for responsible agent development, including human oversight patterns
- Bonus: Context Engineering Cookbook — Practical compaction, memory, and tool clearing strategies
🛠️ Hands-On Exercise (30 min): "Full Scenario Practice"
For your final exercise, simulate an exam scenario end-to-end:
- Pick the Customer Support scenario (it touches the most domains: D1 + D2 + D5)
- Design the architecture:
- Draw the agentic loop with
stop_reasontermination - Define 4-5 tools with proper descriptions and error schemas
- Add a
PreToolUsehook for compliance (blocks PII operations without approval) - Implement escalation triggers (3 programmatic criteria, NOT sentiment)
- Add provenance tracking to tool results
- Draw the agentic loop with
- For each component, identify which anti-pattern the "obvious but wrong" approach would be
- Write the escalation logic — use the structured trigger pattern, not sentiment
This exercise synthesizes Days 1, 4, 6, 7, 19, and 20 into a single coherent design.
❓ Quick Quiz
Question 1: A customer support agent needs to determine when to hand off to a human. Which approach is correct?
A) Check the customer's sentiment — if "angry" or "frustrated", escalate B) Ask Claude to rate its own confidence 1-10 — if below 7, escalate C) Escalate when: tool fails 3x consecutively, policy gap detected, or financial threshold exceeded D) Set a timer — if the conversation exceeds 5 minutes, escalate
Question 2: In a multi-agent extraction pipeline, an agent returns the customer name "John Smith" from an invoice. What should the tool result include for production reliability?
A) Just the value: {"customer_name": "John Smith"}
B) The value plus a self-reported confidence: {"customer_name": "John Smith", "confidence": 0.9}
C) The value with provenance metadata: source file, page number, extraction method, and retrieval timestamp
D) The value with a natural language explanation of how it was found
Question 3: Where should critical, immutable instructions be placed to survive context compaction in a long-running agent?
A) In the first user message of the conversation B) In the system prompt C) Repeated in every tool result D) In a CLAUDE.md file referenced via @import
Answers:
1. C — Escalation must be based on structured, programmatic criteria. A and B are anti-patterns #5 and #4 respectively. D is an arbitrary cap (anti-pattern #2 variant).
2. C — Information provenance requires source attribution, timestamps, and extraction method. B is an anti-pattern (self-reported confidence). A lacks traceability. D is unstructured.
3. B — The system prompt has the highest priority position (primacy effect) and survives compaction. Early user messages (A) may be summarized away. C wastes context. D is for Claude Code, not API-based agents.
🔮 What's Next
You've completed the 20-day study plan! 🏆
Starting tomorrow, we enter review mode: cross-domain scenario questions, timed practice, and targeted drills on the concepts you found hardest. The exam is scenario-based — so we'll practice identifying which domain(s) each question targets and eliminating anti-pattern distractors.
Your exam readiness checklist:
- ☐ Can you list all 10 anti-patterns from memory?
- ☐ Can you map each scenario to its primary domains?
- ☐ Can you write the canonical agentic loop from memory?
- ☐ Can you explain why hooks > prompts for enforcement?
- ☐ Can you describe MCP's 3-layer architecture?
You've built a strong foundation over 20 days. Trust your preparation. 💪