Use Case: Evaluation & Benchmarks
Scenario 1: Setting Up Eval-Driven Development
Section titled “Scenario 1: Setting Up Eval-Driven Development”Profile: QA Lead, establishing quality metrics for AI-assisted workflows
Step 1: Define evals
Section titled “Step 1: Define evals”“How do I measure if our AI agent is producing good code?”
Tool: eval-harness skill
| Input | Types of tasks your agent performs |
| Output | Eval framework design: capability evals, regression evals, grader types |
Key metrics to track:
- pass@1 — First attempt success rate
- pass@3 — Success within 3 attempts (practical reliability)
- pass^3 — All 3 trials succeed (stability for release-critical code)
Step 2: Create eval tasks
Section titled “Step 2: Create eval tasks”“Define specific eval tasks for our codebase.”
Tool: /eval (command)
| Input | Task descriptions with success criteria |
| Output | YAML eval definitions with judge criteria |
Step 3: Run evals
Section titled “Step 3: Run evals”“Run the evals and show me the results.”
Tool: /eval (command) - check mode
| Input | Defined eval tasks |
| Output | Pass/fail results with cost, time, and consistency data |
Scenario 2: Comparing AI Agents
Section titled “Scenario 2: Comparing AI Agents”Profile: Architect, deciding between Claude Code, Aider, and Codex for the team
Step 1: Design comparison evals
Section titled “Step 1: Design comparison evals”“I want to compare three agents on our actual codebase tasks.”
Tool: agent-eval skill
| Input | YAML task definitions representing real work (bug fixes, features, refactors) |
| Output | Comparison framework with judge types per task |
Judge types:
| Type | How it works | Best for |
|---|---|---|
| Code-based | Runs tests (pytest, npm test) | Functional correctness |
| Pattern-based | Grep for expected patterns | Output format validation |
| Model-based | LLM evaluates quality | Subjective quality |
| Human | Manual review | Final validation |
Step 2: Run and report
Section titled “Step 2: Run and report”“Generate a report comparing the agents.”
Tool: /eval (command) - report mode
| Input | Completed eval runs from each agent |
| Output | Comparative report: pass rate, cost per task, time, consistency |
The Eval Workflow
Section titled “The Eval Workflow”1. Define evals (/eval + eval-harness skill) ↓2. Baseline run (measure current state) ↓3. Implement (/tdd command) ↓4. Check evals (/eval check) ↓5. pass@3 > 90% and pass^3 = 100%? ↓ yes ↓ no6. Verify + Review Iterate on implementation ↓7. MergeEval Tool Comparison
Section titled “Eval Tool Comparison”| Tool | Purpose | Audience | Output |
|---|---|---|---|
eval-harness skill | Learn how to set up evals | QA Lead, Architect | Framework design knowledge |
agent-eval skill | Compare different AI agents | Architect | Comparison methodology |
/eval command | Define, run, and report on evals | All | Eval results with metrics |
benchmark skill | Performance benchmarking | Senior Dev | Performance measurement approach |