Use Case: Evaluation & Benchmarks

Scenario 1: Setting Up Eval-Driven Development

Profile: QA Lead, establishing quality metrics for AI-assisted workflows

Step 1: Define evals

“How do I measure if our AI agent is producing good code?”

Tool: eval-harness skill


Input	Types of tasks your agent performs
Output	Eval framework design: capability evals, regression evals, grader types

Key metrics to track:

pass@1 — First attempt success rate
pass@3 — Success within 3 attempts (practical reliability)
pass^3 — All 3 trials succeed (stability for release-critical code)

Step 2: Create eval tasks

“Define specific eval tasks for our codebase.”

Tool: /eval (command)


Input	Task descriptions with success criteria
Output	YAML eval definitions with judge criteria

Step 3: Run evals

“Run the evals and show me the results.”

Tool: /eval (command) - check mode


Input	Defined eval tasks
Output	Pass/fail results with cost, time, and consistency data

Scenario 2: Comparing AI Agents

Profile: Architect, deciding between Claude Code, Aider, and Codex for the team

Step 1: Design comparison evals

“I want to compare three agents on our actual codebase tasks.”

Tool: agent-eval skill


Input	YAML task definitions representing real work (bug fixes, features, refactors)
Output	Comparison framework with judge types per task

Judge types:

Type	How it works	Best for
Code-based	Runs tests (pytest, npm test)	Functional correctness
Pattern-based	Grep for expected patterns	Output format validation
Model-based	LLM evaluates quality	Subjective quality
Human	Manual review	Final validation

Step 2: Run and report

“Generate a report comparing the agents.”

Tool: /eval (command) - report mode


Input	Completed eval runs from each agent
Output	Comparative report: pass rate, cost per task, time, consistency

The Eval Workflow

1. Define evals (/eval + eval-harness skill)
   ↓
2. Baseline run (measure current state)
   ↓
3. Implement (/tdd command)
   ↓
4. Check evals (/eval check)
   ↓
5. pass@3 > 90% and pass^3 = 100%?
   ↓ yes                    ↓ no
6. Verify + Review       Iterate on implementation
   ↓
7. Merge

Eval Tool Comparison

Tool	Purpose	Audience	Output
`eval-harness` skill	Learn how to set up evals	QA Lead, Architect	Framework design knowledge
`agent-eval` skill	Compare different AI agents	Architect	Comparison methodology
`/eval` command	Define, run, and report on evals	All	Eval results with metrics
`benchmark` skill	Performance benchmarking	Senior Dev	Performance measurement approach