Measure and evaluate ExpertPack quality. Companion to the core expertpack skill.
Note: This skill makes external API calls to OpenRouter for blind probing and LLM-as-judge scoring. Requires an API key.
Blind-probe frontier models to measure what percentage of a pack's propositions they cannot answer without the pack loaded:
python3 {skill_dir}/scripts/eval-ek.py <pack-path> [--models model1,model2] [--sample N] [--output FILE]
OPENROUTER_API_KEY env varInterpretation:
| EK Ratio | Meaning |
|---|---|
| ---------- | --------- |
| 0.80+ | Exceptional — almost entirely esoteric |
| 0.60–0.79 | Strong — majority esoteric |
| 0.40–0.59 | Mixed — significant GK padding |
| 0.20–0.39 | Weak — most content already in weights |
| < 0.20 | Minimal value-add |
Add measured ratio to manifest.yaml:
ek_ratio:
value: 0.72
measured: "2026-03-12"
models: ["gpt-4.1-mini", "claude-sonnet-4-6", "gemini-2.0-flash"]
propositions_tested: 142
Automated eval against a pack-powered agent endpoint:
python3 {skill_dir}/scripts/run-eval.py \
--questions <eval-set.yaml> \
--endpoint <ws://host:port/path> \
--output <results.yaml> \
--label "baseline"
Learn more: expertpack.ai · GitHub
共 1 个版本