Evaluation & Benchmarks
Tools for measuring agent quality, running evaluations, and tracking pass@k metrics over time.
View use case scenarios for Evaluation & Benchmarks → 3 items
skill eval
agent-eval
“How do I compare Claude Code vs Aider vs Codex on my tasks?”
Head-to-head comparison of coding agents (Claude Code, Aider, Codex, etc.) on custom tasks with pass rate, cost, time, and consistency metrics
Input YAML task definitions with success criteria
→ Output Comparative results: pass rate, cost, time per agent
skill eval
benchmark
“How do I benchmark my application's performance?”
Use this skill to measure performance baselines, detect regressions before/after PRs, and compare stack alternatives.
Input Application with performance requirements
→ Output Performance benchmarking strategy and measurement approach
skill eval
eval-harness
“How do I set up evaluation-driven development? What metrics matter?”
Formal evaluation framework for Claude Code sessions implementing eval-driven development (EDD) principles
Input AI-assisted workflow needing quality measurement
→ Output Eval framework: capability evals, regression evals, grader types, pass@k