← 返回
数据分析

Incident Fupan (事故复盘) — Structured Root Cause Analysis

事故复盘 / Incident Fupan — structured root cause analysis for production failures, outages, bugs, and near-misses. Use when: (1) 事故复盘 or incident review is need...
事故复盘 / Incident Fupan — structured root cause analysis for production failures, outages, bugs, and near-misses. Use when: (1) 事故复盘 or incident review is need...
scytheshan-pixel
数据分析 clawhub v1.0.0 1 版本 100000 Key: 无需
★ 0
Stars
📥 747
下载
💾 17
安装
1
版本
#latest

概述

Incident Fupan (事故复盘)

Structured root cause analysis and prevention protocol for production incidents.

Language Rule

Report language matches the user's language. Chinese request → Chinese report. English → English.

When to Trigger

  • Production failure, outage, or unexpected loss
  • Agent mistake with real consequences (wrong deployment, fabricated output acted upon)
  • Near-miss that could have caused damage
  • Periodic review of past incidents to extract patterns

Postmortem Flow

Step 1: Gather Evidence

Before writing anything, collect raw evidence. Do NOT rely on memory or assumptions.

Required evidence (use tools to retrieve):

  • Logs: exec("grep -i 'error\|fatal\|exception' {logfile} | tail -50")
  • Git state: exec("git log --oneline -10"), exec("git diff HEAD~1 --stat")
  • Service state: exec("systemctl status {service}"), exec("ps aux | grep {process}")
  • Data files: read any CSVs, configs, or state files involved

Rule: Every factual claim in the postmortem must cite a source. Format: [Source: {filepath}:{line} or {command output}]

If evidence is unavailable (logs rotated, service restarted), explicitly mark: [Evidence unavailable: {reason}]

Step 2: Build Timeline

Construct a minute-by-minute (or hour-by-hour) timeline from evidence:

HH:MM UTC — {what happened} [Source: {evidence}]
HH:MM UTC — {what happened} [Source: {evidence}]
...
HH:MM UTC — Incident resolved / mitigated

Include: first symptom, detection, escalation, diagnosis, fix, verification.

Mark the detection gap: time between first symptom and human awareness. This is often the real problem.

Step 3: 5 Whys Root Cause

Drill down from symptom to root cause:

Why 1: {symptom happened} → Because {direct cause}
Why 2: {direct cause} → Because {deeper cause}
Why 3: {deeper cause} → Because {systemic issue}
Why 4: {systemic issue} → Because {process/design gap}
Why 5: {process/design gap} → Because {root cause}

Stop when you reach something you can change. If you reach "the model hallucinated" — that's not actionable. Go deeper: why was the output trusted without verification? Why was there no checkpoint?

Step 4: Write Report

Save to: ~/incidents/INC{NNN}_{TOPIC}_{YYYYMMDD}.md

Create ~/incidents/ if it doesn't exist.

See references/report-template.md for the full template with all required sections.

Report must include all 8 sections:

  1. Header (ID, severity, date, duration, status)
  2. Executive Summary (3 sentences max)
  3. Timeline with sources
  4. Impact Assessment (quantified: time lost, money lost, trust lost)
  5. 5 Whys Root Cause
  6. Fix (what was done immediately)
  7. Prevention (new rules, checks, or automation to prevent recurrence)
  8. Action Items (P0/P1/P2 with owners and deadlines)

Step 5: Extract Defensive Rules

The most valuable output of a postmortem is new rules that prevent recurrence.

Pattern: Incident → Rule → Enforcement mechanism

Examples from real incidents:

  • Unauthorized deployment → "All deploys require explicit human approval" → Gate in CI/CD
  • Fabricated data report → "All data claims must cite source file + row count" → L1 in Hallucination Guard
  • Session bloat causing errors → "Compact at 80% context usage" → Automated monitoring

See references/patterns.md for a library of incident-to-rule patterns.

Step 6: Reply and Store

  1. Save report file to ~/incidents/
  2. Reply to user with the full report
  3. Store key lessons to long-term memory
  4. If applicable: update AGENTS.md, TOOLS.md, or relevant skill with new rules

Severity Levels

LevelCriteriaResponse Time
--------------------------------
SEV1Money lost, data corrupted, security breachImmediate
SEV2Service down, wrong actions taken from bad dataWithin 1 hour
SEV3Degraded performance, near-miss, wasted time >2hWithin 24 hours
SEV4Minor issue, caught before impactNext convenient time

Integration

  • Hallucination Guard: Incidents caused by agent fabrication → add to L1/L3 detection rules
  • War Room: After postmortem, use War Room to evaluate proposed prevention strategy
  • Postmortem → new rules → enforcement → fewer incidents → feedback loop

References

版本历史

共 1 个版本

  • v1.0.0 当前
    2026-03-30 07:51 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

Hallucination Guard — 4-Layer AI Fabrication Defense

scytheshan-pixel
检测并阻止 AI 代理在任务执行中的幻觉。适用场景:(1)代理声称已创建文件、提交或产物 — 验证它们,(
★ 0 📥 174
data-analysis

A股量化 AkShare

mbpz
A股量化数据分析工具,基于AkShare库获取A股行情、财务数据、板块信息等。用于回答关于A股股票查询、行情数据、财务分析、选股等问题。
★ 162 📥 59,666
data-analysis

Data Analysis

ivangdavila
{"answer":"数据分析与可视化。查询数据库、生成报告、自动化电子表格,将原始数据转化为清晰可行的见解。适用于:(1) 您……"}
★ 198 📥 64,850