Skip to content

`crimes@0.7.5` — Eval Harness Graduation & Detector Trim

Draft release notes for the GitHub Release tagged v0.7.5. The body below is what should go in the Releases page when you cut the tag — that triggers .github/workflows/release.yml and publishes to npm via Trusted Publishing.

crimes@0.7.5 is the eval-harness graduation release. 0.7.0 shipped the first cut of the agentic eval harness alongside crimes feedback; this release rolls up five accumulated calibration patches (0.7.1 → 0.7.5) that turn the harness into production-grade tooling — and finally retires the one 0.6.0 detector whose trigger turned out to be a poor proxy on real-world repos.

  • Eval harness, production-grade. Scoring rubric hardened (charge + finding-id matching, not just slug), runs parallelised, scenario↔ fixture coverage verifier wired into CI, variance sampling (evals:variance), opt-in judge-model pass, per-scenario-kind baselines, end-to-end timing, and --label for repeat-run comparisons without burning a version directory.
  • Scenario coverage went from 12 / 35 detectors → 33 / 34. Thirteen new scenarios across all five scenario kinds (bugfix, context, plan, refactor, review) exercise the previously-unreferenced detectors. Fixture 05-stress-ia-drift extended with JSX components, an admin route, a Commander bin, and modified nav.ts / docs/routes.md so five IA detectors that didn’t have a fixture before now do.
  • visual_regression_review_hint removed. Its trigger — file churn ≥ 0.7 on a UI .tsx file with weak test proximity — was a poor proxy for “needs visual review.” Active development looks identical to the trigger condition from the outside; the detector either said nothing (hand-crafted fixtures had no churn) or paged on every rapidly-iterating UI file. Removing it is cleaner than tuning it.
  • Several detector calibration fixes. large_function priority window, registrar-regex tightening, version-agnostic feedback hint copy, and an import-resolver fix for NodeNext .js.ts specifiers that was understating cross-file signal.
  • Crimes-on-crimes: zero remaining structural highs. Two large files (feedback.ts, context.ts) split into per-responsibility modules; two helper refactors (classifyShape, analyseRoute) removed the last large_function highs from the bundled CLI source. Scan JSON output is byte-identical to pre-split.

Schema: schema_version stays at "0.1.0". The type field’s documented contract is that new values may appear without a bump and consumers should treat unknown values defensively — removing a value follows the same shape. Existing scan JSON files load unchanged.

The 0.7.0 harness was a first cut: structural rubric, two agents, no variance sampling, fragile scenario↔fixture coupling that ended up understating both agents by ~10–20pp because the rubric was checking for detectors that the fixtures didn’t actually fire (74% of “agent failures” turned out to be measurement bugs, not real agent misses).

What landed across 0.7.1–0.7.5:

  • Hardened scorer (0.7.1). referenced_findings now matches by detector type AND by finding id (crime_NNNNN) AND by human charge name (“God Function”), not just by detector slug. Agents that reference a finding any of the three ways now score correctly.
  • Parallelised runs (0.7.1). Default concurrency = 4. A 50-run matrix finishes in 7–9 minutes on a single laptop.
  • Import-resolver fix for NodeNext (0.7.2). TS source files using .js extensions in import specifiers (the NodeNext convention) now resolve to their on-disk .ts counterparts. Several cross-file detectors were silently undercounting because the graph was missing edges.
  • Fixture extensions (0.7.2). Fixtures 05/06/07/08 had detectors in their meta but weren’t actually firing them at the design thresholds. All four updated so the listed detectors fire as documented (concept_alias_drift, all three duplication detectors, design/responsive/a11y at Card.tsx, deep_import via bare specifier).
  • Scenario↔fixture coverage verifier (0.7.2). pnpm --filter evals-runner evals:verify-scenarios enforces that every referenced_findings entry on a scenario produces an actual finding on the fixture’s scan output. Wired into .github/workflows/evals-pr.yml so scenario / fixture drift fails the build instead of silently understating pass rates. See evals/README.md § Scenario↔fixture coverage discipline.
  • Variance sampling (0.7.2). evals:variance ranks scenarios by per-scenario mean ± stddev across repeat samples taken with --label r2, --label r3, etc. Lets us tell agent inconsistency from real detector regressions.
  • Opt-in judge-model pass (0.7.2). pnpm run evals -- --judge adds a per-scenario qualitative scoring pass that the same claude CLI runs in a different role. Captures structured per-question scores; complements the structural rubric.
  • End-to-end duration printed on completion (0.7.2). No more guessing how long a full matrix took.
  • --label flag (0.7.2). Repeat-run variance sampling no longer burns a patch version per sample; results land in evals/results/<version>-<label>/.
  • Continuous-improvement baseline policy (0.7.2 → 0.7.3). Patch bumps for any calibration or product change that moves the baseline, no Changesets / no tags. The accumulated patches roll into the next real release — which is this one.

Track B — detector coverage in scenarios

Section titled “Track B — detector coverage in scenarios”

At 0.7.3 the eval matrix had 12 scenarios, and only 12 of 35 detectors (34%) were referenced by any scenario. 0.7.4 closed that gap:

  • 13 new scenarios across all 5 scenario kinds:
    • bugfix-04-weak-tests — weak_test_signal
    • context-04-monorepo — missing_agent_context
    • plan-04-hotspots — high_fan_in_fan_out + large_file + large_function
    • refactor-01-petty — option_bag_junk_drawer + return_shape_roulette
      • negative_flag_maze
    • refactor-04-monorepo — name_behavior_mismatch + option_bag_junk_drawer + return_shape_roulette
    • refactor-05-action-labels — action_label_drift + copy_ia_drift
    • refactor-08-architecture — layer_violation + deep_import
    • refactor-02-component-shape — duplicate_component_shape
    • refactor-01-large-file — large_file
    • review-01-ia-drift — route_metadata_drift + duplicated_navigation_source + docs_code_drift
    • review-02-react-dashboard — commented_out_code + magic_domain_literal_scatter + duplicated_role_status_plan_check
    • review-03-node-cli-tool — commented_out_code + logic_in_comments
      • weak_test_signal + magic_domain_literal_scatter
    • review-05-permission-and-parallel — permission_ia_drift + parallel_destination + command_drift_docs_code_drift
  • Fixture 05 extensions so five IA detectors that previously had no fixture now do: three JSX components (UserList.tsx, TeamList.tsx, SeatList.tsx) drifting destructive-action verbs for action_label_drift / copy_ia_drift; an admin route + role attribute in nav + manager mention in docs for permission_ia_drift; a parallel admin/billing-plans.ts mirroring the top-level billing-plans.ts for parallel_destination; a Commander bin (bin/iaq.ts) advertising only list + get while docs reference three unadvertised subcommands for command_drift_docs_code_drift.

Track C — detector calibration fixes (0.7.3)

Section titled “Track C — detector calibration fixes (0.7.3)”
  • large_function priority window. Calibration of the priority window that determines which large_function findings rank ahead of others on the default top-N output.
  • Registrar regex tightening. cli_command_registrar shape recognition tightened to reduce false positives on adjacent builder patterns.
  • Version-agnostic feedback hint copy. The inline Give feedback: … hint under findings no longer hard-codes the current minor version.

Track D — visual_regression_review_hint removed (0.7.5)

Section titled “Track D — visual_regression_review_hint removed (0.7.5)”

The detector shipped in 0.6.0 with this trigger:

churn(file) >= 0.7 AND testGap(file) >= 0.7 AND responsive_complexity > 0

In practice:

  • Hand-crafted fixtures have no churn (one or two commits), so the detector said nothing on any fixture in the eval matrix.
  • On real-world repos, “file changed many times recently” is just as likely to mean “under active development” as “needs visual review.” A .tsx file getting better quickly trips the trigger as cleanly as one regressing.

There’s no clean way to tune the churn proxy to distinguish those two cases without a screenshot pipeline or LLM judgement — both of which violate the deterministic wedge. So the detector is gone.

The Finding.type value visual_regression_review_hint no longer appears in scan output. Consumers that explicitly switch on it should treat it like any other removed value: the case becomes unreachable. Detector count goes from 35 → 34.

Track E — crimes-on-crimes housekeeping (0.7.3)

Section titled “Track E — crimes-on-crimes housekeeping (0.7.3)”

The 0.7.0 release split two of the four legitimately-large files flagged by the dogfood signal. The remaining two cleared in 0.7.3:

  • packages/cli/src/commands/feedback.ts split into feedback/write.ts
    • the four read subcommands (list.ts, summary.ts, export.ts, recheck.ts) under feedback/read/.
  • packages/cli/src/commands/context.ts split into 4 modules so the command-handler file no longer trips large_file on its own scan.
  • classifyShape in packages/language-js/src/parse/shapes.ts refactored from a 90-line switch into a chain of try* helpers.
  • analyseRoute in packages/core/src/detectors/route-metadata-drift.ts extracted source / evidence / related helpers.

Scan JSON output is byte-identical to pre-split — these are maintainability-only changes that consume the new structure but emit the same findings.

50 runs (25 scenarios × 2 agents) in 8m 24s on the reference laptop:

AgentStructural pass rate
claude0.82
codex0.81

Per scenario kind:

Kindclaudecodex
bugfix0.800.70
context1.001.00
plan0.640.45
refactor0.870.83
review0.790.91

The two soft spots — plan and parts of bugfix — both come from the new 0.7.4 scenarios (plan-04-hotspots, bugfix-04-weak-tests) which exercise detector types the agents hadn’t been tested on before. These set the floor for the 0.8.0 tuning work.

Result transcripts and rubric scores are committed at evals/results/0.7.5/ and replayable against future scoring tweaks via pnpm run evals:replay.

  • No schema bump. Still schema_version: "0.1.0". Existing scan JSON consumers keep working.
  • No new detectors. This release subtracts one. The 0.6.0 slate minus visual_regression_review_hint is what we calibrate for 0.8.0.
  • No new commands. CLI surface is byte-equivalent to 0.7.0.
  • No LLM-assisted detector modes. Evals call LLMs; detectors stay deterministic.
  • No Python. Still deferred per PRD §26.

The eval baseline at 0.7.5 is the calibration reference. 0.8.0 will:

  • Tune detector thresholds / severities against the per-scenario-kind pass rates and the variance signal from evals:variance.
  • Surface resurface trajectories from the crimes feedback recheck loop that 0.7.0 wired up — every fp → tp flip on a real-world repo is a tuning win.
  • Land the agent-behaviour regression gates the PR replay workflow was built for.

Release notes will quote the specific calibration deltas that drove each tuning change — that’s the value of the loop you’re participating in by upgrading.

Terminal window
npm install -g crimes@0.7.5
crimes --version # crimes@0.7.5

If you have .crimes/suppressions.json entries for visual_regression_review_hint fingerprints, they’ll quietly become no-ops (the detector that generated them is gone). You can leave them or run crimes audit-suppressions to spot them and crimes unignore to clean up.

If your CI pipeline does --fail-on high on crimes scan or crimes baseline check, nothing changes — visual_regression_review_hint only ever emitted severity: "low".