crimes evals — the agentic harness
crimes evals — the agentic harness
Section titled “crimes evals — the agentic harness”The 0.7.0 release introduces the eval harness at evals/. It is
the second half of the calibration story: where
crimes feedback captures Andrew’s verdicts on
real-world scans, the eval harness captures agent behaviour on a
pinned matrix of fixtures and scenarios.
It runs locally on a maintainer’s machine via the claude and
codex CLIs — both authenticate against existing subscriptions, so
no API keys, no per-call billing. CI never invokes a fresh agent
run; it only replays the structural rubric over already-committed
results.
Directory layout
Section titled “Directory layout”evals/ fixtures/ # one directory per fixture 01-messy-ts-app/ # symlink → examples/messy-ts-app 02-react-dashboard/ # OSS clone (gitignored body, committed meta) 03-node-cli-tool/ # OSS clone 04-monorepo/ # OSS clone 05-stress-ia-drift/ # hand-crafted, committed 06-stress-duplication/ # hand-crafted, committed 07-stress-frontend/ # hand-crafted, committed 08-stress-dependency/ # hand-crafted, committed 09-clean-tiny/ # control: should produce zero findings 10-clean-typed/ # control: well-tested strict-TS module fixtures.meta.json # registry: id, path, name, kind, purpose scenarios/ # one JSON file per scenario kind refactor.json bugfix.json review.json context.json plan.json results/ # per-version pinned eval outputs 0.7.0/ claude/<scenario-id>.json codex/<scenario-id>.json summary.json runner/ # the runner workspace package (evals-runner) src/index.ts # orchestrator src/agents/claude.ts # claude CLI shell-out src/agents/codex.ts # codex CLI shell-out src/score.ts # §5.5 structural rubric src/judge.ts # opt-in --judge pass src/replay.ts # evals:replay entry src/diff.ts # evals:diff entry src/setup.ts # evals:setup entry — clones OSS fixturesRunning
Section titled “Running”# One-time per machine — clones OSS fixtures at their pinned SHAs.pnpm run evals:setup
# Full matrix (every fixture × scenario × agent).pnpm run evals
# Subset of the matrix.pnpm run evals -- --agent claudepnpm run evals -- --fixture 01pnpm run evals -- --scenario refactor
# Add the opt-in judge-model pass.pnpm run evals -- --judgePer fixture × scenario × agent invocation, the runner:
cd evals/fixtures/<NN>-<name>and runscrimes scan -f json. That output is the scenario context.- Composes the scenario
prompt+ scan JSON and shells out to the agent (claude -p ... --output-format jsonorcodex exec --json ...). - Captures the response and applies the §5.5 structural rubric.
- (Optional,
--judge) sends scenario + expected_artifacts + response toclaudein an evaluator role; captures structured{score, reasoning}perjudge_questionsentry. - Writes
evals/results/<crimes-version>/<agent>/<scenario-id>.jsonatomically (tempdir + rename).
Structural rubric
Section titled “Structural rubric”Each scenario carries an expected_artifacts block; the runner
checks the agent’s response against it:
referenced_findings— extract every known detector-id from the response (\b<id>\b); one pass per expected id.referenced_files— extract file-path-shaped tokens; one pass per expected path.forbidden_actions— pass when none of the listed regex patterns appear in the response.expected_priority— first detector id in the first 200 chars of the response must match.
Result shape:
interface ScoreResult { scenario: string; agent: "claude" | "codex"; crimes_version: string; timestamp: string; run_id: string; response: string; // preserved so `evals:replay` can re-score structural_score: { passed: number; failed: number; details: ScoreDetail[]; }; judge_score?: { overall: number; // 0-10, mean across per_question per_question: Array<{ question: string; score: number; reasoning: string }>; model: string; };}A summary.json at the version root rolls up per-agent and
per-scenario-kind pass rates after every run.
Judge pass (opt-in, --judge)
Section titled “Judge pass (opt-in, --judge)”--judge sends the scenario + expected_artifacts + agent’s response
back to claude (in a different role) with the scenario’s
judge_questions. Each answer must be a JSON object
{score: 0-10, reasoning: string}; malformed answers are marked
failed (score 0) rather than crashing the run.
The judge pass is opt-in and never gates anything — judge models are stochastic and we don’t want a structural-rubric-stable diff to churn on judge variance. Use it for “did the agent’s reasoning make sense, not just whether it referenced the right finding?” investigations.
CI: replay, never re-invoke
Section titled “CI: replay, never re-invoke”The PR workflow at .github/workflows/evals-pr.yml triggers on PRs
touching detector / scoring / language / CLI / evals code:
- Builds the PR’s crimes binary.
- Runs
pnpm run evals:replay— re-scores every committed result inevals/results/<latest-version>/against the PR’s structural rubric. No agent calls. - Runs
pnpm run evals:diff— compares per-agent pass rates from the replay to the pinned summary. Writesevals/diff-summary.md. - Posts (or updates a single) PR comment with the markdown diff.
The diff is signal, not a gate. Pass-rate moves within ±10% are marked stable; moves outside that band are flagged (“improved” / “regression”). Investigate flagged regressions before merging detector changes.
Adding a fixture
Section titled “Adding a fixture”- Pick a slot number (
02-04reserved for OSS clones;05-08stress;09-10clean controls; start higher for new categories). - Make
evals/fixtures/NN-name/and add the project files.- Hand-crafted: commit the body directly.
- OSS clone: add a
.crimes-eval-meta.jsonwith{upstream, sha, license, purpose}. The body is gitignored;pnpm run evals:setupmaterialises it at the pinned SHA.
- Register the fixture in
evals/fixtures/fixtures.meta.json. - Add scenarios for the fixture to the relevant
evals/scenarios/<kind>.jsonarray.
Adding a scenario
Section titled “Adding a scenario”Each scenario in evals/scenarios/<kind>.json is an object:
{ "id": "refactor-NN-name", "fixture": "NN", "kind": "refactor", "prompt": "...", "expected_artifacts": { "referenced_findings": ["..."], "referenced_files": ["..."], "forbidden_actions": ["..."], "expected_priority": "..." }, "judge_questions": ["..."]}expected_artifacts is what the structural rubric checks against.
judge_questions is what the opt-in judge pass asks. Both are
optional — supply only the checks that make sense for the scenario.
OSS fixture rot
Section titled “OSS fixture rot”OSS upstreams change. We mitigate with:
- Every clone pinned to a specific SHA in the meta file.
.crimes-eval-meta.jsonrecords the upstream + license + purpose at vendoring time.pnpm run evals:setupfails loudly when a clone can’t be retrieved.- If an upstream disappears, mark the fixture as
archived: truein the meta file and the runner skips it.
Subscription CLI availability
Section titled “Subscription CLI availability”The runner does a startup which claude / which codex check.
Missing CLIs are skipped with a setup message (“install it and
re-authenticate, then re-run”) rather than crashing mid-matrix.
--agent claude or --agent codex lets you run the matrix against
just the CLI you have available.
See also
Section titled “See also”evals/README.md— the contributor-facing quick reference (same content but lives next to the harness).docs/scoring.md— the per-finding score model the eval rubric tests detectors against.- §5 of
.planning/archive/0.7.0-calibration-evidence-loop.md— the spec the harness implements.