`crimes@0.7.0` — Calibration & Evidence Loop

Draft release notes for the GitHub Release tagged v0.7.0. The body below is what should go in the Releases page when you cut the tag — that triggers .github/workflows/release.yml and publishes to npm via Trusted Publishing.

TL;DR

crimes@0.7.0 is the calibration release. 0.6.0 added 18 new detector types in one batch; this release ships zero new detectors and instead builds the two feedback mechanisms that turn “crimes runs on a codebase” into “crimes gets better every time it runs on a codebase”:

crimes feedback — a single new command that captures per- finding verdicts (tp / fp / known) into .crimes/feedback.jsonl. On fp, it auto-creates a suppression pinned to the current crimes minor. The suppression auto-resurfaces on the next minor bump so you re-confirm or mark resolved — that re-confirmation trajectory is the highest-value calibration data the project collects.
The evals/ harness — an agentic test bench of 10 fixtures × 5 scenario kinds × claude + codex CLIs (subscription- authenticated, so no API keys / no per-call billing). Structural rubric scores agent responses; an opt-in judge-model pass adds open-ended scoring. CI replays cached results against PR builds to catch detector-tuning regressions.

Plus housekeeping the §20 dogfood appendix from 0.5.0 flagged but 0.6.0 didn’t ship: direct_date test-file exemption, the two legitimately-large files (reporter/src/human.ts, language-js/src/parse.ts) split, and the §6.2 noise baseline recorded as Appendix B of the calibration plan.

Schema: schema_version stays at "0.1.0". Every new field is strictly additive and back-compat — 0.5.0 and 0.6.0 suppressions files load unchanged.

What shipped

Track A — the dogfood feedback loop

crimes feedback <fingerprint> --verdict {tp|fp|known} [--note] is the write path. fp requires --note; the note becomes the suppression reason.
crimes feedback list / summary / export / recheck are the read paths.
Inline hint under every finding in human output: Give feedback: crimes feedback <fp> --verdict {tp|fp}. Suppressed on piped output / --no-color / when 5+ entries already exist for the detector (capped so the prompt doesn’t outlive its usefulness).
Auto-resurface loop — a fingerprint marked fp in 0.7.x is silent for 0.7.x scans, then resurfaces tagged previously_suppressed: true on the first 0.8.x scan with an alternate ”⚠ Previously marked fp in 0.7” hint and a one-line stderr breadcrumb pointing at crimes feedback recheck.
Cross-project rollup at ~/.crimes/feedback-rollup.jsonl via crimes feedback export --append-global. Dedupes by (repo, timestamp, fingerprint); idempotent across runs.
Per-detector release-notes map — crimes feedback recheck prints a hint per resurfaced finding (“direct_date now skips test files — likely resolved.”) so you can decide without re-reading the diff.

Full guide: docs/feedback.md.

Track B — the eval harness

evals/ directory outside packages/. 10 fixtures (1 symlink, 3 OSS clones, 4 stress, 2 clean controls), 12 representative scenarios across 5 kinds (refactor, bugfix, review, context, plan).
pnpm run evals — the orchestrator. Per (fixture × scenario × agent) it runs crimes scan -f json against the fixture, sends the scenario prompt + scan JSON to the agent, captures the response, applies the structural rubric, writes evals/results/<version>/<agent>/<scenario-id>.json.
Opt-in --judge — sends the response back to claude in a judging role with the scenario’s judge_questions. Captures {score, reasoning} per question. Validated with zod; malformed answers become failed (score 0) rather than crashing.
pnpm run evals:replay + pnpm run evals:diff — re-scores committed results against the current crimes build and emits a markdown diff. Used by .github/workflows/evals-pr.yml to post a PR comment with per-agent pass-rate moves outside the ±10% tolerance band.
No agents in CI, no secrets, no API keys. Fresh runs happen on the maintainer’s machine via the locally-installed CLIs (claude -p ... --output-format json + codex exec --json ...) against existing subscriptions.

Full guide: docs/evals.md.

Housekeeping

direct_date skips test files (§20 dogfood false positive closed). New shared helper at packages/core/src/util/test-files.ts consolidates the 8 copies of the test-file regex that were scattered across detectors and the scoring / petty indices.
reporter/src/human.ts split into 10 files under human/ (scan.ts, context.ts, hotspots.ts, diff.ts, baseline.ts, verdict.ts, explain.ts, audit.ts, shared.ts, index.ts). Every file lands under 200 lines; bundled-fixture output is byte-identical to pre-split.
language-js/src/parse.ts split into 12 files under parse/ (index.ts orchestrator, types.ts, constants.ts, utils.ts, functions.ts, shapes.ts, shape-predicates.ts, shape-commander.ts, dates.ts, nav.ts, ui-strings.ts, jsx.ts). Every file under 250 lines; scan JSON output is byte-identical to pre-split.

What’s not in 0.7.0

No new detectors. Period. The 0.6.0 slate is what we calibrate for the next ~3 months. Detector changes in 0.7.0 are exemptions and split-only refactors.
No schema bump. Every addition is optional and back-compat — consumers don’t have to update anything to keep working.
No new commands beyond crimes feedback. The rest of the CLI surface is unchanged from 0.6.0.
No Python. Open question in PRD §26, deferred again.
No LLM-assisted detector modes. Evals call LLMs; detectors stay deterministic. Wedge protection.
No hosted feedback collector. Everything stays local on your machine.

What’s coming in 0.8.0

The calibration data this release collects becomes the inputs to 0.8.0’s detector tuning:

Threshold / severity adjustments based on crimes feedback summary --global outputs.
Resurface trajectories from crimes feedback recheck (fp → tp = win; fp → fp = candidate for further tuning).
Agent-behaviour regressions caught by the PR replay workflow.

The release notes will quote the specific calibration deltas that drove each tuning change — that’s the value of the loop you’re participating in by upgrading.

Upgrading

npm install -g crimes@0.7.0
crimes --version  # crimes@0.7.0

If you previously used crimes ignore to suppress findings — those suppressions all stay source: "manual" and never resurface. The new mechanism only kicks in for entries written by crimes feedback ... --verdict fp (which add source: "feedback" + crimes_version_pinned).

After upgrading, run crimes scan in a project you already use crimes on; every finding in the human output now carries the Give feedback: ... hint. Mark a few that surprised you and you’ve started the loop.

Notable links

docs/feedback.md — full user guide to crimes feedback including the auto-resurface lifecycle.
docs/evals.md — contributor guide to the eval harness.
docs/suppressions.md — extended with the feedback-sourced suppression shape.
docs/json-schema.md — new FeedbackReport type + optional previously_suppressed / previous_suppression fields documented.
.planning/archive/0.7.0-calibration-evidence-loop.md — the full plan this release implemented.