Use scripts/model_tester.py to run repeatable test prompts and compare requested vs actual model usage from OpenClaw logs.
From the skill directory (or pass absolute paths):
python3 scripts/model_tester.py --agent menial --case extract-emails
python3 scripts/model_tester.py --model openai/gpt-4.1 --case math-reasoning
python3 scripts/model_tester.py --agent chat --model openai/gpt-4.1 --case all --out /tmp/model-test.json
--agent : Target agent (chat, menial, coder, etc.)--model : Requested model alias/name to test--case : Case from references/test-cases.json or all--timeout : Per-case timeout (default 120)--out : Optional JSON output fileRequire at least one of --agent or --model.
references/test-cases.json.openclaw logs --follow --json in parallel.openclaw agent --json with a bounded test prompt (asks agent to use a subagent for the task).Top-level JSON:
tooltimestampagentrequested_modelresults[]Each result entry returns:
test_caseagentrequested_modelactual_model (parsed from logs when available)status (ok/error)result_summaryruntime_secondstokens (when discoverable)errors[]The tester spawns isolated subagent tasks with predefined test prompts only — no user data is passed to models. It tails OpenClaw logs to extract:
Log extraction uses regex patterns to find model/token fields. No personally identifiable information or arbitrary log content is captured — only structured fields related to the test execution.
openclaw config is invalid or gateway is unavailable, the script returns status=error with stderr details.references/test-cases.json to add custom prompts for your benchmark set.共 1 个版本