Model Tester

概述

Use scripts/model_tester.py to run repeatable test prompts and compare requested vs actual model usage from OpenClaw logs.

Run

From the skill directory (or pass absolute paths):

python3 scripts/model_tester.py --agent menial --case extract-emails
python3 scripts/model_tester.py --model openai/gpt-4.1 --case math-reasoning
python3 scripts/model_tester.py --agent chat --model openai/gpt-4.1 --case all --out /tmp/model-test.json

Inputs

--agent : Target agent (chat, menial, coder, etc.)
--model : Requested model alias/name to test
--case : Case from references/test-cases.json or all
--timeout : Per-case timeout (default 120)
--out : Optional JSON output file

Require at least one of --agent or --model.

What the runner does

Load test cases from references/test-cases.json.
Start openclaw logs --follow --json in parallel.
Run openclaw agent --json with a bounded test prompt (asks agent to use a subagent for the task).
Parse response + tailed logs.
Emit machine-readable JSON and a short human summary.

Output format

Top-level JSON:

tool
timestamp
agent
requested_model
results[]

Each result entry returns:

test_case
agent
requested_model
actual_model (parsed from logs when available)
status (ok/error)
result_summary
runtime_seconds
tokens (when discoverable)
errors[]

Privacy & Safety

The tester spawns isolated subagent tasks with predefined test prompts only — no user data is passed to models. It tails OpenClaw logs to extract:

which model was actually selected (routing validation)
token usage statistics
runtime metrics

Log extraction uses regex patterns to find model/token fields. No personally identifiable information or arbitrary log content is captured — only structured fields related to the test execution.

Notes

Model extraction and token extraction are best-effort because log fields may vary by OpenClaw/provider version.
If openclaw config is invalid or gateway is unavailable, the script returns status=error with stderr details.
Edit references/test-cases.json to add custom prompts for your benchmark set.
All test cases are generic; no workspace or user data is baked in.

版本历史

共 1 个版本

v1.0.0 当前

2026-03-31 01:27 安全安全

安全检测

腾讯云安全 (Keen)

安全，无风险

查看报告

腾讯云安全 (Sanbu)