Skip to content

Use Case: Evaluation & Benchmarks

Scenario 1: Setting Up Eval-Driven Development

Section titled “Scenario 1: Setting Up Eval-Driven Development”

Profile: QA Lead, establishing quality metrics for AI-assisted workflows

“How do I measure if our AI agent is producing good code?”

Tool: eval-harness skill

InputTypes of tasks your agent performs
OutputEval framework design: capability evals, regression evals, grader types

Key metrics to track:

  • pass@1 — First attempt success rate
  • pass@3 — Success within 3 attempts (practical reliability)
  • pass^3 — All 3 trials succeed (stability for release-critical code)

“Define specific eval tasks for our codebase.”

Tool: /eval (command)

InputTask descriptions with success criteria
OutputYAML eval definitions with judge criteria

“Run the evals and show me the results.”

Tool: /eval (command) - check mode

InputDefined eval tasks
OutputPass/fail results with cost, time, and consistency data

Profile: Architect, deciding between Claude Code, Aider, and Codex for the team

“I want to compare three agents on our actual codebase tasks.”

Tool: agent-eval skill

InputYAML task definitions representing real work (bug fixes, features, refactors)
OutputComparison framework with judge types per task

Judge types:

TypeHow it worksBest for
Code-basedRuns tests (pytest, npm test)Functional correctness
Pattern-basedGrep for expected patternsOutput format validation
Model-basedLLM evaluates qualitySubjective quality
HumanManual reviewFinal validation

“Generate a report comparing the agents.”

Tool: /eval (command) - report mode

InputCompleted eval runs from each agent
OutputComparative report: pass rate, cost per task, time, consistency

1. Define evals (/eval + eval-harness skill)
2. Baseline run (measure current state)
3. Implement (/tdd command)
4. Check evals (/eval check)
5. pass@3 > 90% and pass^3 = 100%?
↓ yes ↓ no
6. Verify + Review Iterate on implementation
7. Merge
ToolPurposeAudienceOutput
eval-harness skillLearn how to set up evalsQA Lead, ArchitectFramework design knowledge
agent-eval skillCompare different AI agentsArchitectComparison methodology
/eval commandDefine, run, and report on evalsAllEval results with metrics
benchmark skillPerformance benchmarkingSenior DevPerformance measurement approach