← 返回
未分类 中文

SOTA Agent

SOTA Agent is a public ClawHub SOTA-campaign skill for CV and DS work. Use it when the user says "sota agent", "state of the art benchmark scouting", or want...
SOTA Agent 是 ClawHub 的公开 SOTA 活动技能,适用于 CV 与 DS 任务。当用户说“sota agent”、“state of the art benchmark scouting”,或需要最新基准时使用。
zack-dev-cm
未分类 clawhub v1.4.4 3 版本 99863.4 Key: 无需
★ 2
Stars
📥 691
下载
💾 2
安装
3
版本
#ablation-planning#agentic-workflows#benchmarking#claim-review#colab#computer-vision#data-science#gpu-training#kaggle#latest#mlops#openclaw#reproducibility

概述

SOTA Agent

Search intent: sota agent, state of the art benchmark scouting, cv benchmark campaign, gpu vm research workflow

Goal

Turn a vague "beat the benchmark" request into a disciplined campaign:

  • fixed target metric and split
  • explicit literature and leaderboard snapshot
  • bounded reproduction plan
  • explicit handoff to the separate execution lane when runs need external tools
  • evidence requirements that can be reviewed without relying on live session state
  • ablations that answer one question at a time
  • promotion only when the claim survives review

This skill is the frontier-planning and candidate-selection layer.

For execution artifacts or promotion evidence, pair it with

data-science-cv-repro-lab; this skill stays focused on planning and claim review.

Use This Skill When

  • the user wants a CV or DS system pushed toward state-of-the-art results
  • the task involves reproducing or surpassing recent papers
  • the workflow needs paper triage, leaderboard tracking, or claim review
  • the workflow needs a clean handoff to an execution skill after the benchmark contract is frozen
  • the user needs experiment management across local runs, notebooks, and long-running jobs
  • the question is whether execution evidence supports a SOTA candidate
  • the question is whether a candidate is a real SOTA step or only noise, leakage, or benchmark overfitting

If the campaign includes serious execution or release review, use this skill to choose and rank candidates,

then use data-science-cv-repro-lab as the execution lane.

Quick Start

  1. Freeze the claim target before touching recipes.
    • Name the task, dataset, metric, split, and target score.
    • Name the current trusted baseline.
    • Name the claim threshold for "match", "beat", or "not enough".
  1. Initialize the campaign records immediately.
    • Use python3 {baseDir}/scripts/init_sota_campaign.py --root --campaign-id --title </code>.</li><li>Use <code>python3 {baseDir}/scripts/init_sota_leaderboard_snapshot.py --out <json> --task <task> --dataset <dataset> --metric <metric> --split <split></code>.</li><li>Use <code>python3 {baseDir}/scripts/init_sota_paper_triage.py --out <json> --campaign-id <id> --task <task></code>.</li><li>Use <code>python3 {baseDir}/scripts/init_sota_program.py --out <json> --campaign-id <id> --task <task> --dataset <dataset> --metric <metric> --split <split></code> when you need one machine-readable benchmark, rerun, delegation, and auth plan.</li><li>Use <code>python3 {baseDir}/scripts/init_sota_candidate_card.py --out <json> --candidate-id <id> --campaign-id <id> --objective <goal></code>.</li><li>If execution review depends on synced QA runs, runtime sweeps, or benchmark panels, store the paired <code>data-science-cv-repro-lab</code> review dashboard path in the program and candidate records before the claim review starts.</li><li>If external execution evidence exists, record the reviewed artifact manifest path in the program and candidate records instead of acting through a live session.</li><li>If the review surface needs manual or visual QA, use <code>python3 {baseDir}/scripts/init_sota_validation_scorecard.py --out <json> --scorecard-id <id> --surface <surface></code>.</li><li>If an external export bundle matters, use <code>python3 {baseDir}/scripts/init_sota_artifact_manifest.py --out <json> --bundle-root <dir></code>.</li><li>If a long execution run is involved, record only sanitized summaries and artifact references in the SOTA campaign files.</li></ul></ol><ol><li>Separate the campaign roles even if one agent performs all of them.</li><ul><li>Scout: papers, leaderboards, repos, and benchmark rules.</li><li>Reproducer: baseline and top-paper reproduction.</li><li>Ablator: controlled change sets and compute allocation.</li><li>Reviewer: contamination, metric drift, and claim integrity.</li><li>Promoter: final claim or hold decision.</li><li>Keep the benchmark definition and final claim wording fixed.</li><li>Use bounded scouting and review lanes for literature triage, repo inspection, per-paper extraction, and hard-case review.</li><li>For repeated audits, batch over a manifest or CSV instead of free-form context accumulation.</li></ul></ol><ol><li>Pick the execution lane explicitly.</li><ul><li>Execution handoff lane: use <code>data-science-cv-repro-lab</code> for external runs and artifact capture.</li><li>Local lane: cheap falsification, tiny reruns, and artifact review.</li></ul></ol><ol><li>Keep file writes inside one campaign workspace.</li><ul><li>Create one dedicated campaign root and keep every <code>--out</code>, <code>--bundle-root</code>, and <code>--output-root</code> path under it.</li><li>Do not point the bundled scripts at unrelated home-directory or system paths.</li><li>Treat <code>scripts/sota_public_safety.py</code> as the canonical public-redaction layer for URLs, refs, and paths.</li></ul></ol><ol><li>Work the SOTA ladder in order.</li><ul><li>Freeze the benchmark definition and auth rule before using more compute.</li><li>Reproduce the trusted baseline first.</li><li>Reproduce one relevant reference result or a close public checkpoint.</li><li>Build a hypothesis backlog from literature gaps, not vibes.</li><li>Run narrow ablations before broad recipe churn.</li><li>Stress the best candidate on the fixed review surfaces.</li></ul></ol><ol><li>Claim only on full-surface wins.</li><ul><li>Fixed benchmark score</li><li>Reproduced baseline delta</li><li>Compute or cost context</li><li>Browser or GUI evidence if that lane mattered</li><li>Failure-case review</li><li>Exact evidence bundle</li><li>Render the final review with <code>python3 {baseDir}/scripts/render_sota_claim_summary.py --candidate-card <json> --out <md></code>.</li></ul></ol><h2>Operating Rules</h2><h3>Campaign rules</h3><ul><li>One campaign has one target benchmark contract.</li><li>Do not let the target metric or split drift midstream.</li><li>Keep a short hypothesis backlog and kill low-information ideas quickly.</li><li>Record why each experiment exists before running it.</li></ul><h3>Codex multi-agent rules</h3><ul><li>Main thread owns the benchmark contract, stop conditions, and final claim decision.</li><li>Subagents should do bounded work only: scout, reproduce, ablate, or review.</li><li>Do not let one exploratory thread silently rewrite the campaign contract.</li><li>For repeated claim checks or literature extraction, prefer manifest-driven fanout over conversational drift.</li></ul><h3>Literature rules</h3><ul><li>Read only the papers or repos that change the candidate plan.</li><li>Extract the minimum useful fields: task, metric, split, data, compute, architecture, augmentations, training tricks, and caveats.</li><li>Prefer a reproduced strong baseline over copying five tricks from five papers without control.</li><li>Do not treat leaderboard rows as ground truth without checking task definition and split rules.</li></ul><h3>Ablation rules</h3><ul><li>Change one meaningful variable at a time when the goal is causal understanding.</li><li>If several knobs move together, label the run as a package change, not an ablation.</li><li>Keep one canonical baseline recipe alive for comparison.</li><li>Require the first winning candidate to survive at least one rerun or adjacent-seed check before escalating the claim.</li></ul><h3>Compute rules</h3><ul><li>Spend cheap compute on reproduction and short falsification first.</li><li>Do not push a long run unless the hypothesis would matter if it wins.</li><li>Record training cost, wall time, and hardware for every serious candidate.</li><li>Cut branches that cannot plausibly clear the target with the remaining budget.</li></ul><h3>Runtime and auth rules</h3><ul><li>This public skill does not require API keys, account tokens, live sessions, or account-bound credentials.</li><li>Prefer local files, public URLs, and user-supplied artifacts over account-bound execution paths.</li><li>Do not require or recommend <code>OPENAI_API_KEY</code>, other vendor API keys, or paid inference APIs as the default campaign runtime path.</li><li>If a third-party framework only works through paid API keys, treat it as reference material unless it can run through local tools or public artifacts.</li></ul><h3>External execution rules</h3><ul><li>Execution rules live in <code>data-science-cv-repro-lab</code>.</li><li>In this skill, record only the benchmark contract, candidate rationale, review status, and sanitized artifact references.</li></ul><h3>Claim safety rules</h3><ul><li>No SOTA claim without a fixed metric, split, and baseline.</li><li>No SOTA claim on a contaminated benchmark or hidden train-on-test path.</li><li>If the execution story depends on a dashboard or synced review surface, keep the dashboard path, source audit, and leakage audit in the claim packet.</li><li>If a candidate wins only on one slice while regressing important surfaces, hold it.</li><li>Report uncertainty honestly: "best internal result so far" is not the same as "new SOTA".</li><li>Small deltas need rerun or adjacent-seed support before they become claim language.</li></ul><h2>References</h2><p>Read only the reference that matches the task:</p><ul><li><code>references/sota-campaign-playbook.md</code></li><li>Full campaign structure, role separation, and stop conditions.</li><li><code>references/sota-program-rules.md</code></li><li>Rules for queues, stage discipline, ablations, and promotion gating.</li><li><code>references/campaign-harness-stack.md</code></li><li>What to reuse from Codex subagents, harness engineering, OpenEvolve, Symphony, Paperclip, and OptiLLM under a local-first campaign rule.</li><li><code>references/benchmark-discipline.md</code></li><li>How to avoid contamination, metric drift, and invalid comparisons.</li><li><code>references/paper-triage.md</code></li><li>How to filter papers and extract only decision-relevant details.</li><li><code>references/public-research-lane.md</code></li><li>How to review public literature and leaderboard pages without private sessions.</li><li><code>references/external-evidence-handoff.md</code></li><li>How to record sanitized evidence from external notebook or UI runs without controlling a live session.</li><li><code>references/execution-evidence-summary.md</code></li><li>How to summarize execution evidence that belongs in the paired execution skill.</li><li><code>references/claim-safety.md</code></li><li>Review rules for whether a candidate deserves a SOTA claim at all.</li><li><code>references/public-safety.md</code></li><li>Publication review rules for secrets, private refs, and raw notebook paths.</li></ul><h2>Bundled Scripts</h2><ul><li><code>scripts/sota_public_safety.py</code></li><li>Pure local helpers for path, URL, ref, env, and command redaction. No network I/O or subprocess execution.</li><li><code>scripts/init_sota_campaign.py</code></li><li>Create a reusable campaign folder with benchmark, program, agent, research, leaderboard, plan, ablation, evidence, and claim files.</li><li><code>scripts/init_sota_program.py</code></li><li>Create a machine-readable program record with the fixed benchmark, baselines, rerun policy, bounded subagent roles, and local-first runtime rules.</li><li><code>scripts/init_sota_leaderboard_snapshot.py</code></li><li>Create a machine-readable snapshot of the target benchmark contract and current reference scores.</li><li><code>scripts/init_sota_paper_triage.py</code></li><li>Create a machine-readable literature queue for paper screening and extraction.</li><li><code>scripts/init_sota_browser_run_card.py</code></li><li>Create a sanitized external-evidence record for notebook or UI run artifacts.</li><li><code>scripts/init_sota_validation_scorecard.py</code></li><li>Create a machine-readable GUI or notebook validation scorecard when visible state matters to the campaign.</li><li><code>scripts/init_sota_artifact_manifest.py</code></li><li>Create a machine-readable export-bundle manifest for external artifacts with redacted public path metadata.</li><li><code>scripts/init_sota_candidate_card.py</code></li><li>Create a machine-readable card for a serious candidate, its execution lane, auth mode, and claim state.</li><li><code>scripts/init_sota_candidate.py</code></li><li>Create a machine-readable candidate record with change set, risks, and redacted public artifact refs.</li><li><code>scripts/init_sota_ablation_queue.py</code></li><li>Create a focused ablation queue for one candidate family.</li><li><code>scripts/init_sota_vm_bootstrap_manifest.py</code></li><li>Create a redacted long-run summary manifest for already-approved execution artifacts.</li><li><code>scripts/update_sota_scoreboard.py</code></li><li>Refresh a ranked scoreboard for a fixed metric and goal direction.</li><li><code>scripts/init_sota_review_packet.py</code></li><li>Join the core artifacts for a promotion, hold, or cut decision.</li><li><code>scripts/render_sota_claim_summary.py</code></li><li>Render a concise markdown review from the machine-readable candidate card.</li><li><code>scripts/render_sota_program_summary.py</code></li><li>Render a concise markdown summary from the program, candidate, scoreboard, and review packet.</li></ul></div> </div> </div> <div id="tab-versions" class="detail-content"> <div class="detail-section"> <h2>版本历史</h2> <p style="margin-bottom:12px;font-size:14px;color:#94a3b8;">共 3 个版本</p> <ul class="version-list"> <li> <div> <span class="version-tag">v1.4.4</span> <span style="font-size:11px;color:#5b6abf;margin-left:8px;background:#eef0ff;padding:1px 8px;border-radius:10px;">当前</span> </div> <div style="font-size:12px;color:#94a3b8;"> 2026-05-07 03:37 安全 安全 </div> </li> <li> <div> <span class="version-tag">v1.4.3</span> </div> <div style="font-size:12px;color:#94a3b8;"> 2026-05-01 18:15 安全 安全 </div> </li> <li> <div> <span class="version-tag">v1.2.3</span> </div> <div style="font-size:12px;color:#94a3b8;"> 2026-03-30 15:39 </div> </li> </ul> </div> </div> <div id="tab-security" class="detail-content"> <div class="detail-section"> <h2>安全检测</h2> <div class="sec-grid"> <div class="sec-card"> <h4>腾讯云安全 (Keen)</h4> <div class="sec-status sec-safe"> 安全,无风险 </div> <a href="https://tix.qq.com/search/skill?keyword=7dc8b7298ccfa265839af34477da8ac1" target="_blank">查看报告</a> </div> <div class="sec-card"> <h4>腾讯云安全 (Sanbu)</h4> <div class="sec-status sec-safe"> 安全,无风险 </div> <a href="https://static.cloudsec.tencent.com/html-report-v2/2026/05/25/416057_ceb89cbce46e49e3de2467a06d759164.html?q-sign-algorithm=sha1&q-ak=AKID8JMG1bzBC1dz96qNhssfFftujT1NCoFi&q-sign-time=1781224861%3B1812760861&q-key-time=1781224861%3B1812760861&q-header-list=host&q-url-param-list=&q-signature=061adb88ceff512fccff69b49c15da154c8797b4" target="_blank">查看报告</a> </div> </div> </div> </div> <!-- Recommended Skills --> <div style="margin-top:24px;"> <h2 style="font-size:18px;font-weight:600;margin-bottom:16px;">🔗 相关推荐</h2> <div class="rec-grid"> <div class="rec-card"> <span class="badge-cat" style="margin-bottom:8px;display:inline-block;"></span> <h3><a href="/s/data-science-cv-repro-lab">Data Science CV Repro Reviewer</a></h3> <div class="rec-owner">zack-dev-cm</div> <div class="rec-desc">审查计算机视觉实验的可重复性证据、数据集准备度、指标阈值和上线风险。当用户要求谨慎的CV实验时使用。</div> <div class="rec-stats"> <span style="color:#f39c12;">★ 1</span> <span style="color:#5b6abf;">📥 850</span> </div> </div> <div class="rec-card"> <span class="badge-cat" style="margin-bottom:8px;display:inline-block;"></span> <h3><a href="/s/github-clawhub-launcher">GitHub ClawHub Release Reviewer</a></h3> <div class="rec-owner">zack-dev-cm</div> <div class="rec-desc">在人工执行命令前,审查 GitHub 和 ClawHub 的发布计划、元数据、标签、发布说明及最终发布顺序。</div> <div class="rec-stats"> <span style="color:#f39c12;">★ 1</span> <span style="color:#5b6abf;">📥 703</span> </div> </div> <div class="rec-card"> <span class="badge-cat" style="margin-bottom:8px;display:inline-block;"></span> <h3><a href="/s/agentic-codex-dev">Agentic Codex Dev Reviewer</a></h3> <div class="rec-owner">zack-dev-cm</div> <div class="rec-desc">审查自主软件开发的计划和发布准备,针对 Codex、GitHub 与 ClawHub 工作。适用于用户请求范围明确的交付规划、实现...</div> <div class="rec-stats"> <span style="color:#f39c12;">★ 1</span> <span style="color:#5b6abf;">📥 733</span> </div> </div> </div> </div> </div> <script> document.addEventListener('DOMContentLoaded',function(){ document.querySelectorAll('.detail-tab').forEach(function(btn){ btn.addEventListener('click',function(e){ var tab = this.getAttribute('data-tab'); document.querySelectorAll('.detail-tab').forEach(function(b){b.classList.remove('active')}); document.querySelectorAll('.detail-content').forEach(function(c){c.classList.remove('active')}); this.classList.add('active'); var el = document.getElementById('tab-'+tab); if(el) el.classList.add('active'); }); }); }); </script> <div class="footer"> <p>Skill工具集 © 2026</p> </div></body> </html>