← 返回
未分类 中文

Hallucination Guard — 4-Layer AI Fabrication Defense

Detect and prevent AI agent hallucinations during task execution. Use when: (1) an agent claims to have created files, commits, or artifacts — verify them, (...
检测并阻止 AI 代理在任务执行中的幻觉。适用场景:(1)代理声称已创建文件、提交或产物 — 验证它们,(
scytheshan-pixel
未分类 clawhub v1.0.0 1 版本 100000 Key: 无需
★ 0
Stars
📥 166
下载
💾 2
安装
1
版本
#latest

概述

Hallucination Guard

4-layer defense against agent fabrication. Each layer is independent — use one or combine.

When Hallucinations Happen

Highest risk conditions (apply more layers when these are present):

  • Extended sessions (>50 turns or >30min continuous work)
  • Tasks involving file creation, code, git, or data analysis
  • Agent reporting quantitative results (numbers, metrics, PnL)
  • Multiple sequential "successes" with no errors or retries

Layer 0: Context Hygiene (Prevention)

Reduce hallucination probability before it starts.

For long tasks (>10 steps):

  1. Break into segments of ≤8 steps each
  2. Between segments: flush working state to a file, reload from file (not from in-context memory)
  3. Each segment starts with read of the state file — never trust carried-over context for facts

For data-intensive tasks:

  • Load source data from files at point of use, not from earlier context
  • If a number was mentioned 20+ turns ago, re-read the source before citing it

Cost: Zero. This is a workflow discipline, not an API call.

Layer 1: Claim-Evidence Protocol (Detection)

Every agent claim of physical action must include tool-verified evidence.

The Rule

CLAIM:    "I created/modified/committed X"
EVIDENCE: Tool output proving X exists and matches the claim
STATUS:   VERIFIED (evidence confirms) or UNVERIFIED (no evidence yet)

Verification Commands by Claim Type

ClaimVerify With
--------------------
Created filels -la {path} && head -20 {path}
Modified filegrep -n '{expected_content}' {path}
Git commitgit log --oneline -3
Git pushgit log --oneline origin/{branch} -3
Ran testsShow actual test output (pass AND fail counts)
API responseShow raw response body
Data analysisShow wc -l of source + sample rows

Red Flags (claim likely fabricated)

  • Claim references a file but no read/exec tool was called
  • Exact round numbers in data (187 trades, +$126.50) without source
  • "All tests passed" with no test output shown
  • Multiple consecutive successes with zero errors

Cost: ~50 tokens per claim. One exec call per physical claim.

Layer 2: Cross-Model Audit (Verification)

Spawn a second agent (different model) to independently verify claims.

When to Use

  • Critical outputs: financial reports, deployment decisions, data analysis
  • When L1 evidence exists but numbers need independent validation
  • After any task where the agent reported unusually perfect results

How to Run

See references/audit-prompt.md for the spawn template.

Key principles:

  1. Auditor receives ONLY the evidence (files, outputs) — not the original agent's conclusions
  2. Auditor independently extracts facts from evidence and compares to claims
  3. Auditor uses the cheapest model that can do the verification (flash for file checks, sonnet for logic)

Cost: 1 subagent spawn. Use flash/gemini for simple checks (~$0.001). Reserve sonnet/opus for complex logic verification.

Layer 3: Drift Detection (Monitoring)

Monitor long-running agent tasks for hallucination patterns.

When to Use

  • Tasks expected to take >15 minutes
  • Agent is working autonomously (coding agent, research agent)
  • High-stakes tasks where undetected fabrication causes real damage

Setup

See references/drift-monitor.md for implementation.

Core signals:

  • Claim/Tool Ratio: If claims > 3× tool calls → alert
  • Zero-Error Streak: 8+ consecutive "successes" with 0 errors → suspicious
  • Phantom References: Agent references files/branches never created → critical alert

Cost: Periodic check via sessions_history. No extra model calls unless alert triggers.

Choosing Layers

ScenarioRecommended
-----------------------
Quick file creationL1 only
Data report from CSVL0 + L1
Multi-step coding taskL0 + L1 + L2
Autonomous long-running agentAll four layers
Routine conversationNone needed

Integration with Other Skills

  • War Room: Add L1 verification to each agent's output (verify cited data)
  • Coding agents: Wrap with L3 drift monitor for long sessions
  • Any task with sessions_spawn: Add L2 audit as a final verification step

References

版本历史

共 1 个版本

  • v1.0.0 当前
    2026-05-12 06:12 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

security-compliance

Skill Vetter

spclaudehome
AI智能体技能安全预审工具。安装ClawdHub、GitHub等来源技能前,检查风险信号、权限范围及可疑模式。
★ 1,210 📥 266,145
ai-intelligence

Self-Improving + Proactive Agent

ivangdavila
自我反思+自我批评+自我学习+自组织记忆。智能体评估自身工作、发现错误并持续改进。
★ 1,349 📥 317,694
developer-tools

Github

steipete
使用 `gh` CLI 与 GitHub 交互,通过 `gh issue`、`gh pr`、`gh run` 和 `gh api` 管理议题、PR、CI 运行及高级查询。
★ 666 📥 323,786