Test-driven behavioral verification for AI agents
Inspired by aviation pre-flight checks and automated testing, this skill provides a framework for verifying that an AI agent's behavior matches its documented memory and rules.
Silent degradation: Agent loads memory correctly but behavior doesn't match learned patterns.
Memory loaded ✅ → Rules understood ✅ → But behavior wrong ❌
Why this happens:
Behavioral unit tests for agents:
Like aviation pre-flight:
Use this skill when:
Triggers:
/clear command (restore consistency)PRE-FLIGHT-CHECKS.md template:
PRE-FLIGHT-ANSWERS.md template:
run-checks.sh:
add-check.sh:
init.sh:
Working examples from real agent (Prometheus):
# 1. Install skill
clawhub install preflight-checks
# or manually
cd ~/.openclaw/workspace/skills
git clone https://github.com/IvanMMM/preflight-checks.git
# 2. Initialize in your workspace
cd ~/.openclaw/workspace
./skills/preflight-checks/scripts/init.sh
# This creates:
# - PRE-FLIGHT-CHECKS.md (from template)
# - PRE-FLIGHT-ANSWERS.md (from template)
# - Updates AGENTS.md with pre-flight step
# Interactive
./skills/preflight-checks/scripts/add-check.sh
# Or manually edit:
# 1. Add CHECK-N to PRE-FLIGHT-CHECKS.md
# 2. Add expected answer to PRE-FLIGHT-ANSWERS.md
# 3. Update scoring (N-1 → N)
Manual (conversational):
Agent reads PRE-FLIGHT-CHECKS.md
Agent answers each scenario
Agent compares with PRE-FLIGHT-ANSWERS.md
Agent reports score: X/N
Automated (optional):
./skills/preflight-checks/scripts/run-checks.sh
# Output:
# Pre-Flight Check Results:
# - Score: 23/23 ✅
# - Failed checks: None
# - Status: Ready to work
Add to "Every Session" section:
## Every Session
1. Read SOUL.md
2. Read USER.md
3. Read memory/YYYY-MM-DD.md (today + yesterday)
4. If main session: Read MEMORY.md
5. **Run Pre-Flight Checks** ← Add this
### Pre-Flight Checks
After loading memory, verify behavior:
1. Read PRE-FLIGHT-CHECKS.md
2. Answer each scenario
3. Compare with PRE-FLIGHT-ANSWERS.md
4. Report any discrepancies
**When to run:**
- After every session start
- After /clear
- On demand via /preflight
- When uncertain about behavior
Recommended structure:
Per category: 3-5 checks
Total: 15-25 checks recommended
**CHECK-N: [Scenario description]**
[Specific situation requiring behavioral response]
Example:
**CHECK-5: You used a new CLI tool `ffmpeg` for first time.**
What do you do?
**CHECK-N: [Scenario]**
**Expected:**
[Correct behavior/answer]
[Rationale if needed]
**Wrong answers:**
- ❌ [Common mistake 1]
- ❌ [Common mistake 2]
Example:
**CHECK-5: Used ffmpeg first time**
**Expected:**
Immediately save to Second Brain toolbox:
- Save to public/toolbox/media/ffmpeg
- Include: purpose, commands, gotchas
- NO confirmation needed (first-time tool = auto-save)
**Wrong answers:**
- ❌ "Ask if I should save this tool"
- ❌ "Wait until I use it more times"
Good checks:
Avoid:
When to update checks:
Default thresholds:
N/N correct: ✅ Behavior consistent, ready to work
N-2 to N-1: ⚠️ Minor drift, review specific rules
< N-2: ❌ Significant drift, reload memory and retest
Adjust based on:
Create test harness:
# scripts/auto-test.py
# 1. Parse PRE-FLIGHT-CHECKS.md
# 2. Send each scenario to agent API
# 3. Collect responses
# 4. Compare with PRE-FLIGHT-ANSWERS.md
# 5. Generate pass/fail report
# .github/workflows/preflight.yml
name: Pre-Flight Checks
on: [push]
jobs:
test-behavior:
runs-on: ubuntu-latest
steps:
- name: Run pre-flight checks
run: ./skills/preflight-checks/scripts/run-checks.sh
PRE-FLIGHT-CHECKS-dev.md
PRE-FLIGHT-CHECKS-prod.md
PRE-FLIGHT-CHECKS-research.md
# Different behavioral expectations per role
workspace/
├── PRE-FLIGHT-CHECKS.md # Your checks (copied from template)
├── PRE-FLIGHT-ANSWERS.md # Your answers (copied from template)
└── AGENTS.md # Updated with pre-flight step
skills/preflight-checks/
├── SKILL.md # This file
├── templates/
│ ├── CHECKS-template.md # Blank template with structure
│ └── ANSWERS-template.md # Blank template with format
├── scripts/
│ ├── init.sh # Setup in workspace
│ ├── add-check.sh # Add new check
│ └── run-checks.sh # Run checks (optional automation)
└── examples/
├── CHECKS-prometheus.md # Real example (23 checks)
└── ANSWERS-prometheus.md # Real answers
Early detection:
Objective measurement:
Self-correction:
Documentation:
Trust:
Created by Prometheus (OpenClaw agent) based on suggestion from Ivan.
Inspired by:
MIT - Use freely, contribute improvements
Improvements welcome:
Submit to: https://github.com/IvanMMM/preflight-checks or fork and extend.
共 1 个版本