SOTA Agent

Search intent: sota agent, state of the art benchmark scouting, cv benchmark campaign, gpu vm research workflow

Goal

Turn a vague "beat the benchmark" request into a disciplined campaign:

fixed target metric and split
explicit literature and leaderboard snapshot
bounded reproduction plan
explicit handoff to the separate execution lane when runs need external tools
evidence requirements that can be reviewed without relying on live session state
ablations that answer one question at a time
promotion only when the claim survives review

This skill is the frontier-planning and candidate-selection layer.

For execution artifacts or promotion evidence, pair it with

data-science-cv-repro-lab; this skill stays focused on planning and claim review.

Use This Skill When

the user wants a CV or DS system pushed toward state-of-the-art results
the task involves reproducing or surpassing recent papers
the workflow needs paper triage, leaderboard tracking, or claim review
the workflow needs a clean handoff to an execution skill after the benchmark contract is frozen
the user needs experiment management across local runs, notebooks, and long-running jobs
the question is whether execution evidence supports a SOTA candidate
the question is whether a candidate is a real SOTA step or only noise, leakage, or benchmark overfitting

If the campaign includes serious execution or release review, use this skill to choose and rank candidates,

then use data-science-cv-repro-lab as the execution lane.

Quick Start

Freeze the claim target before touching recipes.

Name the task, dataset, metric, split, and target score.
Name the current trusted baseline.
Name the claim threshold for "match", "beat", or "not enough".

Initialize the campaign records immediately.

Use python3 {baseDir}/scripts/init_sota_campaign.py --root --campaign-id --title </code>.</li><li>Use <code>python3 {baseDir}/scripts/init_sota_leaderboard_snapshot.py --out <json> --task <task> --dataset <dataset> --metric <metric> --split <split></code>.</li><li>Use <code>python3 {baseDir}/scripts/init_sota_paper_triage.py --out <json> --campaign-id <id> --task <task></code>.</li><li>Use <code>python3 {baseDir}/scripts/init_sota_program.py --out <json> --campaign-id <id> --task <task> --dataset <dataset> --metric <metric> --split <split></code> when you need one machine-readable benchmark, rerun, delegation, and auth plan.</li><li>Use <code>python3 {baseDir}/scripts/init_sota_candidate_card.py --out <json> --candidate-id <id> --campaign-id <id> --objective <goal></code>.</li><li>If execution review depends on synced QA runs, runtime sweeps, or benchmark panels, store the paired <code>data-science-cv-repro-lab</code> review dashboard path in the program and candidate records before the claim review starts.</li><li>If external execution evidence exists, record the reviewed artifact manifest path in the program and candidate records instead of acting through a live session.</li><li>If the review surface needs manual or visual QA, use <code>python3 {baseDir}/scripts/init_sota_validation_scorecard.py --out <json> --scorecard-id <id> --surface <surface></code>.</li><li>If an external export bundle matters, use <code>python3 {baseDir}/scripts/init_sota_artifact_manifest.py --out <json> --bundle-root <dir></code>.</li><li>If a long execution run is involved, record only sanitized summaries and artifact references in the SOTA campaign files.</li></ul></ol><ol><li>Separate the campaign roles even if one agent performs all of them.</li><ul><li>Scout: papers, leaderboards, repos, and benchmark rules.</li><li>Reproducer: baseline and top-paper reproduction.</li><li>Ablator: controlled change sets and compute allocation.</li><li>Reviewer: contamination, metric drift, and claim integrity.</li><li>Promoter: final claim or hold decision.</li><li>Keep the benchmark definition and final claim wording fixed.</li><li>Use bounded scouting and review lanes for literature triage, repo inspection, per-paper extraction, and hard-case review.</li><li>For repeated audits, batch over a manifest or CSV instead of free-form context accumulation.</li></ul></ol><ol><li>Pick the execution lane explicitly.</li><ul><li>Execution handoff lane: use <code>data-science-cv-repro-lab</code> for external runs and artifact capture.</li><li>Local lane: cheap falsification, tiny reruns, and artifact review.</li></ul></ol><ol><li>Keep file writes inside one campaign workspace.</li><ul><li>Create one dedicated campaign root and keep every <code>--out</code>, <code>--bundle-root</code>, and <code>--output-root</code> path under it.</li><li>Do not point the bundled scripts at unrelated home-directory or system paths.</li><li>Treat <code>scripts/sota_public_safety.py</code> as the canonical public-redaction layer for URLs, refs, and paths.</li></ul></ol><ol><li>Work the SOTA ladder in order.</li><ul><li>Freeze the benchmark definition and auth rule before using more compute.</li><li>Reproduce the trusted baseline first.</li><li>Reproduce one relevant reference result or a close public checkpoint.</li><li>Build a hypothesis backlog from literature gaps, not vibes.</li><li>Run narrow ablations before broad recipe churn.</li><li>Stress the best candidate on the fixed review surfaces.</li></ul></ol><ol><li>Claim only on full-surface wins.</li><ul><li>Fixed benchmark score</li><li>Reproduced baseline delta</li><li>Compute or cost context</li><li>Browser or GUI evidence if that lane mattered</li><li>Failure-case review</li><li>Exact evidence bundle</li><li>Render the final review with <code>python3 {baseDir}/scripts/render_sota_claim_summary.py --candidate-card <json> --out <md></code>.</li></ul></ol><h2>Operating Rules</h2><h3>Campaign rules</h3><ul><li>One campaign has one target benchmark contract.</li><li>Do not let the target metric or split drift midstream.</li><li>Keep a short hypothesis backlog and kill low-information ideas quickly.</li><li>Record why each experiment exists before running it.</li></ul><h3>Codex multi-agent rules</h3><ul><li>Main thread owns the benchmark contract, stop conditions, and final claim decision.</li><li>Subagents should do bounded work only: scout, reproduce, ablate, or review.</li><li>Do not let one exploratory thread silently rewrite the campaign contract.</li><li>For repeated claim checks or literature extraction, prefer manifest-driven fanout over conversational drift.</li></ul><h3>Literature rules</h3><ul><li>Read only the papers or repos that change the candidate plan.</li><li>Extract the minimum useful fields: task, metric, split, data, compute, architecture, augmentations, training tricks, and caveats.</li><li>Prefer a reproduced strong baseline over copying five tricks from five papers without control.</li><li>Do not treat leaderboard rows as ground truth without checking task definition and split rules.</li></ul><h3>Ablation rules</h3><ul><li>Change one meaningful variable at a time when the goal is causal understanding.</li><li>If several knobs move together, label the run as a package change, not an ablation.</li><li>Keep one canonical baseline recipe alive for comparison.</li><li>Require the first winning candidate to survive at least one rerun or adjacent-seed check before escalating the claim.</li></ul><h3>Compute rules</h3><ul><li>Spend cheap compute on reproduction and short falsification first.</li><li>Do not push a long run unless the hypothesis would matter if it wins.</li><li>Record training cost, wall time, and hardware for every serious candidate.</li><li>Cut branches that cannot plausibly clear the target with the remaining budget.</li></ul><h3>Runtime and auth rules</h3><ul><li>This public skill does not require API keys, account tokens, live sessions, or account-bound credentials.</li><li>Prefer local files, public URLs, and user-supplied artifacts over account-bound execution paths.</li><li>Do not require or recommend <code>OPENAI_API_KEY</code>, other vendor API keys, or paid inference APIs as the default campaign runtime path.</li><li>If a third-party framework only works through paid API keys, treat it as reference material unless it can run through local tools or public artifacts.</li></ul><h3>External execution rules</h3><ul><li>Execution rules live in <code>data-science-cv-repro-lab</code>.</li><li>In this skill, record only the benchmark contract, candidate rationale, review status, and sanitized artifact references.</li></ul><h3>Claim safety rules</h3><ul><li>No SOTA claim without a fixed metric, split, and baseline.</li><li>No SOTA claim on a contaminated benchmark or hidden train-on-test path.</li><li>If the execution story depends on a dashboard or synced review surface, keep the dashboard path, source audit, and leakage audit in the claim packet.</li><li>If a candidate wins only on one slice while regressing important surfaces, hold it.</li><li>Report uncertainty honestly: "best internal result so far" is not the same as "new SOTA".</li><li>Small deltas need rerun or adjacent-seed support before they become claim language.</li></ul><h2>References</h2><p>Read only the reference that matches the task:</p><ul><li><code>references/sota-campaign-playbook.md</code></li><li>Full campaign structure, role separation, and stop conditions.</li><li><code>references/sota-program-rules.md</code></li><li>Rules for queues, stage discipline, ablations, and promotion gating.</li><li><code>references/campaign-harness-stack.md</code></li><li>What to reuse from Codex subagents, harness engineering, OpenEvolve, Symphony, Paperclip, and OptiLLM under a local-first campaign rule.</li><li><code>references/benchmark-discipline.md</code></li><li>How to avoid contamination, metric drift, and invalid comparisons.</li><li><code>references/paper-triage.md</code></li><li>How to filter papers and extract only decision-relevant details.</li><li><code>references/public-research-lane.md</code></li><li>How to review public literature and leaderboard pages without private sessions.</li><li><code>references/external-evidence-handoff.md</code></li><li>How to record sanitized evidence from external notebook or UI runs without controlling a live session.</li><li><code>references/execution-evidence-summary.md</code></li><li>How to summarize execution evidence that belongs in the paired execution skill.</li><li><code>references/claim-safety.md</code></li><li>Review rules for whether a candidate deserves a SOTA claim at all.</li><li><code>references/public-safety.md</code></li><li>Publication review rules for secrets, private refs, and raw notebook paths.</li></ul><h2>Bundled Scripts</h2><ul><li><code>scripts/sota_public_safety.py</code></li><li>Pure local helpers for path, URL, ref, env, and command redaction. No network I/O or subprocess execution.</li><li><code>scripts/init_sota_campaign.py</code></li><li>Create a reusable campaign folder with benchmark, program, agent, research, leaderboard, plan, ablation, evidence, and claim files.</li><li><code>scripts/init_sota_program.py</code></li><li>Create a machine-readable program record with the fixed benchmark, baselines, rerun policy, bounded subagent roles, and local-first runtime rules.</li><li><code>scripts/init_sota_leaderboard_snapshot.py</code></li><li>Create a machine-readable snapshot of the target benchmark contract and current reference scores.</li><li><code>scripts/init_sota_paper_triage.py</code></li><li>Create a machine-readable literature queue for paper screening and extraction.</li><li><code>scripts/init_sota_browser_run_card.py</code></li><li>Create a sanitized external-evidence record for notebook or UI run artifacts.</li><li><code>scripts/init_sota_validation_scorecard.py</code></li><li>Create a machine-readable GUI or notebook validation scorecard when visible state matters to the campaign.</li><li><code>scripts/init_sota_artifact_manifest.py</code></li><li>Create a machine-readable export-bundle manifest for external artifacts with redacted public path metadata.</li><li><code>scripts/init_sota_candidate_card.py</code></li><li>Create a machine-readable card for a serious candidate, its execution lane, auth mode, and claim state.</li><li><code>scripts/init_sota_candidate.py</code></li><li>Create a machine-readable candidate record with change set, risks, and redacted public artifact refs.</li><li><code>scripts/init_sota_ablation_queue.py</code></li><li>Create a focused ablation queue for one candidate family.</li><li><code>scripts/init_sota_vm_bootstrap_manifest.py</code></li><li>Create a redacted long-run summary manifest for already-approved execution artifacts.</li><li><code>scripts/update_sota_scoreboard.py</code></li><li>Refresh a ranked scoreboard for a fixed metric and goal direction.</li><li><code>scripts/init_sota_review_packet.py</code></li><li>Join the core artifacts for a promotion, hold, or cut decision.</li><li><code>scripts/render_sota_claim_summary.py</code></li><li>Render a concise markdown review from the machine-readable candidate card.</li><li><code>scripts/render_sota_program_summary.py</code></li><li>Render a concise markdown summary from the program, candidate, scoreboard, and review packet.</li></ul></div> </div> </div> <div id="tab-versions" class="detail-content"> <div class="detail-section"> <h2>版本历史</h2> <p style="margin-bottom:12px;font-size:14px;color:#94a3b8;">共 3 个版本</p> <ul class="version-list"> <li> <div> <span class="version-tag">v1.4.4</span> <span style="font-size:11px;color:#5b6abf;margin-left:8px;background:#eef0ff;padding:1px 8px;border-radius:10px;">当前</span> </div> <div style="font-size:12px;color:#94a3b8;"> 2026-05-07 03:37 安全安全 </div> </li> <li> <div> <span class="version-tag">v1.4.3</span> </div> <div style="font-size:12px;color:#94a3b8;"> 2026-05-01 18:15 安全安全 </div> </li> <li> <div> <span class="version-tag">v1.2.3</span> </div> <div style="font-size:12px;color:#94a3b8;"> 2026-03-30 15:39 </div> </li> </ul> </div> </div> <div id="tab-security" class="detail-content"> <div class="detail-section"> <h2>安全检测</h2> <div class="sec-grid"> <div class="sec-card"> <h4>腾讯云安全 (Keen)</h4> <div class="sec-status sec-safe"> 安全，无风险 </div> <a href="https://tix.qq.com/search/skill?keyword=7dc8b7298ccfa265839af34477da8ac1" target="_blank">查看报告</a> </div> <div class="sec-card"> <h4>腾讯云安全 (Sanbu)</h4> <div class="sec-status sec-safe"> 安全，无风险 </div> <a href="https://static.cloudsec.tencent.com/html-report-v2/2026/05/25/416057_ceb89cbce46e49e3de2467a06d759164.html?q-sign-algorithm=sha1&q-ak=AKID8JMG1bzBC1dz96qNhssfFftujT1NCoFi&q-sign-time=1781224861%3B1812760861&q-key-time=1781224861%3B1812760861&q-header-list=host&q-url-param-list=&q-signature=061adb88ceff512fccff69b49c15da154c8797b4" target="_blank">查看报告</a> </div> </div> </div> </div>  <div style="margin-top:24px;"> <h2 style="font-size:18px;font-weight:600;margin-bottom:16px;">🔗 相关推荐</h2> <div class="rec-grid"> <div class="rec-card"> <span class="badge-cat" style="margin-bottom:8px;display:inline-block;"></span> <h3><a href="/s/data-science-cv-repro-lab">Data Science CV Repro Reviewer</a></h3> <div class="rec-owner">zack-dev-cm</div> <div class="rec-desc">审查计算机视觉实验的可重复性证据、数据集准备度、指标阈值和上线风险。当用户要求谨慎的CV实验时使用。</div> <div class="rec-stats"> <span style="color:#f39c12;">★ 1</span> <span style="color:#5b6abf;">📥 850</span> </div> </div> <div class="rec-card"> <span class="badge-cat" style="margin-bottom:8px;display:inline-block;"></span> <h3><a href="/s/github-clawhub-launcher">GitHub ClawHub Release Reviewer</a></h3> <div class="rec-owner">zack-dev-cm</div> <div class="rec-desc">在人工执行命令前，审查 GitHub 和 ClawHub 的发布计划、元数据、标签、发布说明及最终发布顺序。</div> <div class="rec-stats"> <span style="color:#f39c12;">★ 1</span> <span style="color:#5b6abf;">📥 703</span> </div> </div> <div class="rec-card"> <span class="badge-cat" style="margin-bottom:8px;display:inline-block;"></span> <h3><a href="/s/agentic-codex-dev">Agentic Codex Dev Reviewer</a></h3> <div class="rec-owner">zack-dev-cm</div> <div class="rec-desc">审查自主软件开发的计划和发布准备，针对 Codex、GitHub 与 ClawHub 工作。适用于用户请求范围明确的交付规划、实现...</div> <div class="rec-stats"> <span style="color:#f39c12;">★ 1</span> <span style="color:#5b6abf;">📥 733</span> </div> </div> </div> </div> </div> <script> document.addEventListener('DOMContentLoaded',function(){ document.querySelectorAll('.detail-tab').forEach(function(btn){ btn.addEventListener('click',function(e){ var tab = this.getAttribute('data-tab'); document.querySelectorAll('.detail-tab').forEach(function(b){b.classList.remove('active')}); document.querySelectorAll('.detail-content').forEach(function(c){c.classList.remove('active')}); this.classList.add('active'); var el = document.getElementById('tab-'+tab); if(el) el.classList.add('active'); }); }); }); </script> <div class="footer"> <p>Skill工具集 © 2026</p> </div></body> </html>

SOTA Agent

概述

SOTA Agent

Goal

Use This Skill When

Quick Start