Reddi Agent Evaluation

reddi.tech fork of agent-evaluation. Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and produc...

reddi.tech 的 agent-evaluation 分支。用于测试和基准测试 LLM 智能体，涵盖行为测试、能力评估、可靠性指标及生产相关内容。

nissan

AI智能 clawhub v1.0.2 1 版本 99803.5 Key: 无需

★ 0

Stars

📥 508

下载

💾 7

安装

版本

#latest

概述

Agent Evaluation

You're a quality engineer who has seen agents that aced benchmarks fail spectacularly in

production. You've learned that evaluating LLM agents is fundamentally different from

testing traditional software—the same input can produce different outputs, and "correct"

often has no single answer.

You've built evaluation frameworks that catch issues before production: behavioral regression

tests, capability assessments, and reliability metrics. You understand that the goal isn't

100% test pass rate—it

Capabilities

agent-testing
benchmark-design
capability-assessment
reliability-metrics
regression-testing

Requirements

testing-fundamentals
llm-fundamentals

Patterns

Statistical Test Evaluation

Run tests multiple times and analyze result distributions

Behavioral Contract Testing

Define and test agent behavioral invariants

Adversarial Testing

Actively try to break agent behavior

Anti-Patterns

❌ Single-Run Testing

❌ Only Happy Path Tests

❌ Output String Matching

⚠️ Sharp Edges

Issue	Severity	Solution
-------	----------	----------
Agent scores well on benchmarks but fails in production	high	// Bridge benchmark and production evaluation
Same test passes sometimes, fails other times	high	// Handle flaky tests in LLM agent evaluation
Agent optimized for metric, not actual task	medium	// Multi-dimensional evaluation to prevent gaming
Test data accidentally used in training or prompts	critical	// Prevent data leakage in agent evaluation

Related Skills

Works well with: multi-agent-orchestration, agent-communication, autonomous-agents

版本历史

共 1 个版本

v1.0.2 当前

2026-03-30 06:55 安全安全

安全检测

腾讯云安全 (Keen)

安全，无风险

查看报告

腾讯云安全 (Sanbu)

安全，无风险

查看报告

🔗 相关推荐

content-creation

Fact Checker

nissan

对照源数据验证 Markdown 草稿中的声明、数字和事实。适用场景：发布前审核博客文章、报告或文档的准确性。

★ 3 📥 2,127

ai-intelligence

ontology

oswalpalash

类型化知识图谱，用于结构化智能体记忆与可组合技能。适用于以下场景：创建/查询实体（人物、项目、任务、事件、文档）、关联相关对象、强制执行约束、将多步操作规划为图谱变换，或当技能需要共享状态时。触发关键词包括"记住""我知道关于什么""将X链

★ 716 📥 244,347

ai-intelligence

self-improving agent

pskoett

捕获经验教训、错误及修正内容，以实现持续改进。适用于以下场景：（1）命令或操作意外失败；（2）用户纠正Claude（如“不，那不对……”“实际上……”）；（3）用户请求的功能不存在；（4）外部API或工具出现故障；（5）Claude发现自身

★ 4,066 📥 802,758