crimes evals — the agentic harness

The 0.7.0 release introduces the eval harness at evals/. It is the second half of the calibration story: where crimes feedback captures Andrew’s verdicts on real-world scans, the eval harness captures agent behaviour on a pinned matrix of fixtures and scenarios.

It runs locally on a maintainer’s machine via the claude and codex CLIs — both authenticate against existing subscriptions, so no API keys, no per-call billing. CI never invokes a fresh agent run; it only replays the structural rubric over already-committed results.

Directory layout

evals/
  fixtures/                  # one directory per fixture
    01-messy-ts-app/         # symlink → examples/messy-ts-app
    02-react-dashboard/      # OSS clone (gitignored body, committed meta)
    03-node-cli-tool/        # OSS clone
    04-monorepo/             # OSS clone
    05-stress-ia-drift/      # hand-crafted, committed
    06-stress-duplication/   # hand-crafted, committed
    07-stress-frontend/      # hand-crafted, committed
    08-stress-dependency/    # hand-crafted, committed
    09-clean-tiny/           # control: should produce zero findings
    10-clean-typed/          # control: well-tested strict-TS module
    fixtures.meta.json       # registry: id, path, name, kind, purpose
  scenarios/                 # one JSON file per scenario kind
    refactor.json bugfix.json review.json context.json plan.json
  results/                   # per-version pinned eval outputs
    0.7.0/
      claude/<scenario-id>.json
      codex/<scenario-id>.json
      summary.json
  runner/                    # the runner workspace package (evals-runner)
    src/index.ts             # orchestrator
    src/agents/claude.ts     # claude CLI shell-out
    src/agents/codex.ts      # codex CLI shell-out
    src/score.ts             # §5.5 structural rubric
    src/judge.ts             # opt-in --judge pass
    src/replay.ts            # evals:replay entry
    src/diff.ts              # evals:diff entry
    src/setup.ts             # evals:setup entry — clones OSS fixtures

Running

# One-time per machine — clones OSS fixtures at their pinned SHAs.
pnpm run evals:setup

# Full matrix (every fixture × scenario × agent).
pnpm run evals

# Subset of the matrix.
pnpm run evals -- --agent claude
pnpm run evals -- --fixture 01
pnpm run evals -- --scenario refactor

# Add the opt-in judge-model pass.
pnpm run evals -- --judge

Per fixture × scenario × agent invocation, the runner:

cd evals/fixtures/<NN>-<name> and runs crimes scan -f json. That output is the scenario context.
Composes the scenario prompt + scan JSON and shells out to the agent (claude -p ... --output-format json or codex exec --json ...).
Captures the response and applies the §5.5 structural rubric.
(Optional, --judge) sends scenario + expected_artifacts + response to claude in an evaluator role; captures structured {score, reasoning} per judge_questions entry.
Writes evals/results/<crimes-version>/<agent>/<scenario-id>.json atomically (tempdir + rename).

Structural rubric

Each scenario carries an expected_artifacts block; the runner checks the agent’s response against it:

referenced_findings — extract every known detector-id from the response (\b<id>\b); one pass per expected id.
referenced_files — extract file-path-shaped tokens; one pass per expected path.
forbidden_actions — pass when none of the listed regex patterns appear in the response.
expected_priority — first detector id in the first 200 chars of the response must match.

Result shape:

interface ScoreResult {
  scenario: string;
  agent: "claude" | "codex";
  crimes_version: string;
  timestamp: string;
  run_id: string;
  response: string;            // preserved so `evals:replay` can re-score
  structural_score: {
    passed: number;
    failed: number;
    details: ScoreDetail[];
  };
  judge_score?: {
    overall: number;           // 0-10, mean across per_question
    per_question: Array<{ question: string; score: number; reasoning: string }>;
    model: string;
  };
}

A summary.json at the version root rolls up per-agent and per-scenario-kind pass rates after every run.

Judge pass (opt-in, `--judge`)

--judge sends the scenario + expected_artifacts + agent’s response back to claude (in a different role) with the scenario’s judge_questions. Each answer must be a JSON object {score: 0-10, reasoning: string}; malformed answers are marked failed (score 0) rather than crashing the run.

The judge pass is opt-in and never gates anything — judge models are stochastic and we don’t want a structural-rubric-stable diff to churn on judge variance. Use it for “did the agent’s reasoning make sense, not just whether it referenced the right finding?” investigations.

CI: replay, never re-invoke

The PR workflow at .github/workflows/evals-pr.yml triggers on PRs touching detector / scoring / language / CLI / evals code:

Builds the PR’s crimes binary.
Runs pnpm run evals:replay — re-scores every committed result in evals/results/<latest-version>/ against the PR’s structural rubric. No agent calls.
Runs pnpm run evals:diff — compares per-agent pass rates from the replay to the pinned summary. Writes evals/diff-summary.md.
Posts (or updates a single) PR comment with the markdown diff.

The diff is signal, not a gate. Pass-rate moves within ±10% are marked stable; moves outside that band are flagged (“improved” / “regression”). Investigate flagged regressions before merging detector changes.

Adding a fixture

Pick a slot number (02-04 reserved for OSS clones; 05-08 stress; 09-10 clean controls; start higher for new categories).
Make evals/fixtures/NN-name/ and add the project files.
- Hand-crafted: commit the body directly.
- OSS clone: add a .crimes-eval-meta.json with {upstream, sha, license, purpose}. The body is gitignored; pnpm run evals:setup materialises it at the pinned SHA.
Register the fixture in evals/fixtures/fixtures.meta.json.
Add scenarios for the fixture to the relevant evals/scenarios/<kind>.json array.

Adding a scenario

Each scenario in evals/scenarios/<kind>.json is an object:

{
  "id": "refactor-NN-name",
  "fixture": "NN",
  "kind": "refactor",
  "prompt": "...",
  "expected_artifacts": {
    "referenced_findings": ["..."],
    "referenced_files": ["..."],
    "forbidden_actions": ["..."],
    "expected_priority": "..."
  },
  "judge_questions": ["..."]
}

expected_artifacts is what the structural rubric checks against. judge_questions is what the opt-in judge pass asks. Both are optional — supply only the checks that make sense for the scenario.

OSS fixture rot

OSS upstreams change. We mitigate with:

Every clone pinned to a specific SHA in the meta file.
.crimes-eval-meta.json records the upstream + license + purpose at vendoring time.
pnpm run evals:setup fails loudly when a clone can’t be retrieved.
If an upstream disappears, mark the fixture as archived: true in the meta file and the runner skips it.

Subscription CLI availability

The runner does a startup which claude / which codex check. Missing CLIs are skipped with a setup message (“install it and re-authenticate, then re-run”) rather than crashing mid-matrix. --agent claude or --agent codex lets you run the matrix against just the CLI you have available.

crimes evals — the agentic harness

crimes evals — the agentic harness

Directory layout

Running

Structural rubric

Judge pass (opt-in, --judge)

CI: replay, never re-invoke

Adding a fixture

Adding a scenario

OSS fixture rot

Subscription CLI availability

See also

Judge pass (opt-in, `--judge`)