`crimes@0.7.5` — Eval Harness Graduation & Detector Trim
Draft release notes for the GitHub Release tagged
v0.7.5. The body below is what should go in the Releases page when you cut the tag — that triggers.github/workflows/release.ymland publishes to npm via Trusted Publishing.
crimes@0.7.5 is the eval-harness graduation release. 0.7.0 shipped
the first cut of the agentic eval harness alongside crimes feedback;
this release rolls up five accumulated calibration patches (0.7.1 →
0.7.5) that turn the harness into production-grade tooling — and
finally retires the one 0.6.0 detector whose trigger turned out to be
a poor proxy on real-world repos.
- Eval harness, production-grade. Scoring rubric hardened (charge +
finding-id matching, not just slug), runs parallelised, scenario↔
fixture coverage verifier wired into CI, variance sampling
(
evals:variance), opt-in judge-model pass, per-scenario-kind baselines, end-to-end timing, and--labelfor repeat-run comparisons without burning a version directory. - Scenario coverage went from 12 / 35 detectors → 33 / 34. Thirteen
new scenarios across all five scenario kinds (
bugfix,context,plan,refactor,review) exercise the previously-unreferenced detectors. Fixture 05-stress-ia-drift extended with JSX components, an admin route, a Commander bin, and modifiednav.ts/docs/routes.mdso five IA detectors that didn’t have a fixture before now do. visual_regression_review_hintremoved. Its trigger — file churn ≥ 0.7 on a UI.tsxfile with weak test proximity — was a poor proxy for “needs visual review.” Active development looks identical to the trigger condition from the outside; the detector either said nothing (hand-crafted fixtures had no churn) or paged on every rapidly-iterating UI file. Removing it is cleaner than tuning it.- Several detector calibration fixes.
large_functionpriority window, registrar-regex tightening, version-agnostic feedback hint copy, and an import-resolver fix for NodeNext.js→.tsspecifiers that was understating cross-file signal. - Crimes-on-crimes: zero remaining structural highs. Two large
files (
feedback.ts,context.ts) split into per-responsibility modules; two helper refactors (classifyShape,analyseRoute) removed the lastlarge_functionhighs from the bundled CLI source. Scan JSON output is byte-identical to pre-split.
Schema: schema_version stays at "0.1.0". The type field’s
documented contract is that new values may appear without a bump and
consumers should treat unknown values defensively — removing a value
follows the same shape. Existing scan JSON files load unchanged.
What shipped (across 0.7.1 → 0.7.5)
Section titled “What shipped (across 0.7.1 → 0.7.5)”Track A — eval harness graduation
Section titled “Track A — eval harness graduation”The 0.7.0 harness was a first cut: structural rubric, two agents, no variance sampling, fragile scenario↔fixture coupling that ended up understating both agents by ~10–20pp because the rubric was checking for detectors that the fixtures didn’t actually fire (74% of “agent failures” turned out to be measurement bugs, not real agent misses).
What landed across 0.7.1–0.7.5:
- Hardened scorer (0.7.1).
referenced_findingsnow matches by detector type AND by finding id (crime_NNNNN) AND by human charge name (“God Function”), not just by detector slug. Agents that reference a finding any of the three ways now score correctly. - Parallelised runs (0.7.1). Default concurrency = 4. A 50-run matrix finishes in 7–9 minutes on a single laptop.
- Import-resolver fix for NodeNext (0.7.2). TS source files using
.jsextensions inimportspecifiers (the NodeNext convention) now resolve to their on-disk.tscounterparts. Several cross-file detectors were silently undercounting because the graph was missing edges. - Fixture extensions (0.7.2). Fixtures 05/06/07/08 had detectors in their meta but weren’t actually firing them at the design thresholds. All four updated so the listed detectors fire as documented (concept_alias_drift, all three duplication detectors, design/responsive/a11y at Card.tsx, deep_import via bare specifier).
- Scenario↔fixture coverage verifier (0.7.2).
pnpm --filter evals-runner evals:verify-scenariosenforces that everyreferenced_findingsentry on a scenario produces an actual finding on the fixture’s scan output. Wired into.github/workflows/evals-pr.ymlso scenario / fixture drift fails the build instead of silently understating pass rates. Seeevals/README.md§ Scenario↔fixture coverage discipline. - Variance sampling (0.7.2).
evals:varianceranks scenarios by per-scenario mean ± stddev across repeat samples taken with--label r2,--label r3, etc. Lets us tell agent inconsistency from real detector regressions. - Opt-in judge-model pass (0.7.2).
pnpm run evals -- --judgeadds a per-scenario qualitative scoring pass that the sameclaudeCLI runs in a different role. Captures structured per-question scores; complements the structural rubric. - End-to-end duration printed on completion (0.7.2). No more guessing how long a full matrix took.
--labelflag (0.7.2). Repeat-run variance sampling no longer burns a patch version per sample; results land inevals/results/<version>-<label>/.- Continuous-improvement baseline policy (0.7.2 → 0.7.3). Patch bumps for any calibration or product change that moves the baseline, no Changesets / no tags. The accumulated patches roll into the next real release — which is this one.
Track B — detector coverage in scenarios
Section titled “Track B — detector coverage in scenarios”At 0.7.3 the eval matrix had 12 scenarios, and only 12 of 35 detectors (34%) were referenced by any scenario. 0.7.4 closed that gap:
- 13 new scenarios across all 5 scenario kinds:
bugfix-04-weak-tests— weak_test_signalcontext-04-monorepo— missing_agent_contextplan-04-hotspots— high_fan_in_fan_out + large_file + large_functionrefactor-01-petty— option_bag_junk_drawer + return_shape_roulette- negative_flag_maze
refactor-04-monorepo— name_behavior_mismatch + option_bag_junk_drawer + return_shape_rouletterefactor-05-action-labels— action_label_drift + copy_ia_driftrefactor-08-architecture— layer_violation + deep_importrefactor-02-component-shape— duplicate_component_shaperefactor-01-large-file— large_filereview-01-ia-drift— route_metadata_drift + duplicated_navigation_source + docs_code_driftreview-02-react-dashboard— commented_out_code + magic_domain_literal_scatter + duplicated_role_status_plan_checkreview-03-node-cli-tool— commented_out_code + logic_in_comments- weak_test_signal + magic_domain_literal_scatter
review-05-permission-and-parallel— permission_ia_drift + parallel_destination + command_drift_docs_code_drift
- Fixture 05 extensions so five IA detectors that previously had
no fixture now do: three JSX components (
UserList.tsx,TeamList.tsx,SeatList.tsx) drifting destructive-action verbs for action_label_drift / copy_ia_drift; an admin route + role attribute in nav + manager mention in docs for permission_ia_drift; a paralleladmin/billing-plans.tsmirroring the top-levelbilling-plans.tsfor parallel_destination; a Commander bin (bin/iaq.ts) advertising onlylist+getwhile docs reference three unadvertised subcommands for command_drift_docs_code_drift.
Track C — detector calibration fixes (0.7.3)
Section titled “Track C — detector calibration fixes (0.7.3)”large_functionpriority window. Calibration of the priority window that determines which large_function findings rank ahead of others on the default top-N output.- Registrar regex tightening.
cli_command_registrarshape recognition tightened to reduce false positives on adjacent builder patterns. - Version-agnostic feedback hint copy. The inline
Give feedback: …hint under findings no longer hard-codes the current minor version.
Track D — visual_regression_review_hint removed (0.7.5)
Section titled “Track D — visual_regression_review_hint removed (0.7.5)”The detector shipped in 0.6.0 with this trigger:
churn(file) >= 0.7 AND testGap(file) >= 0.7 AND responsive_complexity > 0In practice:
- Hand-crafted fixtures have no churn (one or two commits), so the detector said nothing on any fixture in the eval matrix.
- On real-world repos, “file changed many times recently” is just as
likely to mean “under active development” as “needs visual review.”
A
.tsxfile getting better quickly trips the trigger as cleanly as one regressing.
There’s no clean way to tune the churn proxy to distinguish those two cases without a screenshot pipeline or LLM judgement — both of which violate the deterministic wedge. So the detector is gone.
The Finding.type value visual_regression_review_hint no longer
appears in scan output. Consumers that explicitly switch on it should
treat it like any other removed value: the case becomes unreachable.
Detector count goes from 35 → 34.
Track E — crimes-on-crimes housekeeping (0.7.3)
Section titled “Track E — crimes-on-crimes housekeeping (0.7.3)”The 0.7.0 release split two of the four legitimately-large files flagged by the dogfood signal. The remaining two cleared in 0.7.3:
packages/cli/src/commands/feedback.tssplit intofeedback/write.ts- the four read subcommands (
list.ts,summary.ts,export.ts,recheck.ts) underfeedback/read/.
- the four read subcommands (
packages/cli/src/commands/context.tssplit into 4 modules so the command-handler file no longer tripslarge_fileon its own scan.classifyShapeinpackages/language-js/src/parse/shapes.tsrefactored from a 90-line switch into a chain oftry*helpers.analyseRouteinpackages/core/src/detectors/route-metadata-drift.tsextracted source / evidence / related helpers.
Scan JSON output is byte-identical to pre-split — these are maintainability-only changes that consume the new structure but emit the same findings.
Baseline at 0.7.5
Section titled “Baseline at 0.7.5”50 runs (25 scenarios × 2 agents) in 8m 24s on the reference laptop:
| Agent | Structural pass rate |
|---|---|
claude | 0.82 |
codex | 0.81 |
Per scenario kind:
| Kind | claude | codex |
|---|---|---|
| bugfix | 0.80 | 0.70 |
| context | 1.00 | 1.00 |
| plan | 0.64 | 0.45 |
| refactor | 0.87 | 0.83 |
| review | 0.79 | 0.91 |
The two soft spots — plan and parts of bugfix — both come from
the new 0.7.4 scenarios (plan-04-hotspots, bugfix-04-weak-tests)
which exercise detector types the agents hadn’t been tested on
before. These set the floor for the 0.8.0 tuning work.
Result transcripts and rubric scores are committed at
evals/results/0.7.5/ and replayable
against future scoring tweaks via pnpm run evals:replay.
What’s not in 0.7.5
Section titled “What’s not in 0.7.5”- No schema bump. Still
schema_version: "0.1.0". Existing scan JSON consumers keep working. - No new detectors. This release subtracts one. The 0.6.0 slate minus visual_regression_review_hint is what we calibrate for 0.8.0.
- No new commands. CLI surface is byte-equivalent to 0.7.0.
- No LLM-assisted detector modes. Evals call LLMs; detectors stay deterministic.
- No Python. Still deferred per PRD §26.
What’s coming in 0.8.0
Section titled “What’s coming in 0.8.0”The eval baseline at 0.7.5 is the calibration reference. 0.8.0 will:
- Tune detector thresholds / severities against the per-scenario-kind
pass rates and the variance signal from
evals:variance. - Surface resurface trajectories from the
crimes feedback recheckloop that 0.7.0 wired up — every fp → tp flip on a real-world repo is a tuning win. - Land the agent-behaviour regression gates the PR replay workflow was built for.
Release notes will quote the specific calibration deltas that drove each tuning change — that’s the value of the loop you’re participating in by upgrading.
Upgrading
Section titled “Upgrading”npm install -g crimes@0.7.5crimes --version # crimes@0.7.5If you have .crimes/suppressions.json entries for
visual_regression_review_hint fingerprints, they’ll quietly become
no-ops (the detector that generated them is gone). You can leave them
or run crimes audit-suppressions to spot them and crimes unignore
to clean up.
If your CI pipeline does --fail-on high on crimes scan or crimes baseline check, nothing changes — visual_regression_review_hint only
ever emitted severity: "low".
Notable links
Section titled “Notable links”evals/README.md— full harness docs including the scenario↔fixture coverage discipline.docs/finding-types/frontend.md— five surviving frontend detectors + a note on the removal.docs/agent-usage.md— updated “shipped” matrix with the post-0.7.5 detector slate.docs/feedback.md— the calibration loop that generates the inputs to 0.8.0 tuning.