Evaluation & Benchmarks

Tools for measuring agent quality, running evaluations, and tracking pass@k metrics over time.

“How do I compare Claude Code vs Aider vs Codex on my tasks?”

Head-to-head comparison of coding agents (Claude Code, Aider, Codex, etc.) on custom tasks with pass rate, cost, time, and consistency metrics

“How do I benchmark my application's performance?”

Use this skill to measure performance baselines, detect regressions before/after PRs, and compare stack alternatives.

“How do I set up evaluation-driven development? What metrics matter?”

Formal evaluation framework for Claude Code sessions implementing eval-driven development (EDD) principles