`crimes@0.7.0` — Calibration & Evidence Loop
Draft release notes for the GitHub Release tagged
v0.7.0. The body below is what should go in the Releases page when you cut the tag — that triggers.github/workflows/release.ymland publishes to npm via Trusted Publishing.
crimes@0.7.0 is the calibration release. 0.6.0 added 18 new
detector types in one batch; this release ships zero new detectors
and instead builds the two feedback mechanisms that turn “crimes
runs on a codebase” into “crimes gets better every time it runs
on a codebase”:
crimes feedback— a single new command that captures per- finding verdicts (tp/fp/known) into.crimes/feedback.jsonl. Onfp, it auto-creates a suppression pinned to the current crimes minor. The suppression auto-resurfaces on the next minor bump so you re-confirm or mark resolved — that re-confirmation trajectory is the highest-value calibration data the project collects.- The
evals/harness — an agentic test bench of 10 fixtures × 5 scenario kinds ×claude+codexCLIs (subscription- authenticated, so no API keys / no per-call billing). Structural rubric scores agent responses; an opt-in judge-model pass adds open-ended scoring. CI replays cached results against PR builds to catch detector-tuning regressions.
Plus housekeeping the §20 dogfood appendix from 0.5.0 flagged but
0.6.0 didn’t ship: direct_date test-file exemption, the two
legitimately-large files (reporter/src/human.ts,
language-js/src/parse.ts) split, and the §6.2 noise baseline
recorded as Appendix B of the
calibration plan.
Schema: schema_version stays at "0.1.0". Every new field is
strictly additive and back-compat — 0.5.0 and 0.6.0 suppressions
files load unchanged.
What shipped
Section titled “What shipped”Track A — the dogfood feedback loop
Section titled “Track A — the dogfood feedback loop”crimes feedback <fingerprint> --verdict {tp|fp|known} [--note]is the write path.fprequires--note; the note becomes the suppression reason.crimes feedback list / summary / export / recheckare the read paths.- Inline hint under every finding in human output:
Give feedback: crimes feedback <fp> --verdict {tp|fp}. Suppressed on piped output /--no-color/ when 5+ entries already exist for the detector (capped so the prompt doesn’t outlive its usefulness). - Auto-resurface loop — a fingerprint marked
fpin0.7.xis silent for0.7.xscans, then resurfaces taggedpreviously_suppressed: trueon the first0.8.xscan with an alternate ”⚠ Previously marked fp in 0.7” hint and a one-line stderr breadcrumb pointing atcrimes feedback recheck. - Cross-project rollup at
~/.crimes/feedback-rollup.jsonlviacrimes feedback export --append-global. Dedupes by(repo, timestamp, fingerprint); idempotent across runs. - Per-detector release-notes map —
crimes feedback recheckprints a hint per resurfaced finding (“direct_date now skips test files — likely resolved.”) so you can decide without re-reading the diff.
Full guide: docs/feedback.md.
Track B — the eval harness
Section titled “Track B — the eval harness”evals/directory outsidepackages/. 10 fixtures (1 symlink, 3 OSS clones, 4 stress, 2 clean controls), 12 representative scenarios across 5 kinds (refactor, bugfix, review, context, plan).pnpm run evals— the orchestrator. Per (fixture × scenario × agent) it runscrimes scan -f jsonagainst the fixture, sends the scenario prompt + scan JSON to the agent, captures the response, applies the structural rubric, writesevals/results/<version>/<agent>/<scenario-id>.json.- Opt-in
--judge— sends the response back toclaudein a judging role with the scenario’sjudge_questions. Captures{score, reasoning}per question. Validated with zod; malformed answers becomefailed(score 0) rather than crashing. pnpm run evals:replay+pnpm run evals:diff— re-scores committed results against the current crimes build and emits a markdown diff. Used by.github/workflows/evals-pr.ymlto post a PR comment with per-agent pass-rate moves outside the ±10% tolerance band.- No agents in CI, no secrets, no API keys. Fresh runs happen
on the maintainer’s machine via the locally-installed CLIs
(
claude -p ... --output-format json+codex exec --json ...) against existing subscriptions.
Full guide: docs/evals.md.
Housekeeping
Section titled “Housekeeping”direct_dateskips test files (§20 dogfood false positive closed). New shared helper atpackages/core/src/util/test-files.tsconsolidates the 8 copies of the test-file regex that were scattered across detectors and the scoring / petty indices.reporter/src/human.tssplit into 10 files underhuman/(scan.ts,context.ts,hotspots.ts,diff.ts,baseline.ts,verdict.ts,explain.ts,audit.ts,shared.ts,index.ts). Every file lands under 200 lines; bundled-fixture output is byte-identical to pre-split.language-js/src/parse.tssplit into 12 files underparse/(index.tsorchestrator,types.ts,constants.ts,utils.ts,functions.ts,shapes.ts,shape-predicates.ts,shape-commander.ts,dates.ts,nav.ts,ui-strings.ts,jsx.ts). Every file under 250 lines; scan JSON output is byte-identical to pre-split.
What’s not in 0.7.0
Section titled “What’s not in 0.7.0”- No new detectors. Period. The 0.6.0 slate is what we calibrate for the next ~3 months. Detector changes in 0.7.0 are exemptions and split-only refactors.
- No schema bump. Every addition is optional and back-compat — consumers don’t have to update anything to keep working.
- No new commands beyond
crimes feedback. The rest of the CLI surface is unchanged from 0.6.0. - No Python. Open question in PRD §26, deferred again.
- No LLM-assisted detector modes. Evals call LLMs; detectors stay deterministic. Wedge protection.
- No hosted feedback collector. Everything stays local on your machine.
What’s coming in 0.8.0
Section titled “What’s coming in 0.8.0”The calibration data this release collects becomes the inputs to 0.8.0’s detector tuning:
- Threshold / severity adjustments based on
crimes feedback summary --globaloutputs. - Resurface trajectories from
crimes feedback recheck(fp → tp = win; fp → fp = candidate for further tuning). - Agent-behaviour regressions caught by the PR replay workflow.
The release notes will quote the specific calibration deltas that drove each tuning change — that’s the value of the loop you’re participating in by upgrading.
Upgrading
Section titled “Upgrading”npm install -g crimes@0.7.0crimes --version # crimes@0.7.0If you previously used crimes ignore to suppress findings — those
suppressions all stay source: "manual" and never resurface. The
new mechanism only kicks in for entries written by crimes feedback ... --verdict fp (which add source: "feedback" +
crimes_version_pinned).
After upgrading, run crimes scan in a project you already use
crimes on; every finding in the human output now carries the
Give feedback: ... hint. Mark a few that surprised you and you’ve
started the loop.
Notable links
Section titled “Notable links”docs/feedback.md— full user guide tocrimes feedbackincluding the auto-resurface lifecycle.docs/evals.md— contributor guide to the eval harness.docs/suppressions.md— extended with the feedback-sourced suppression shape.docs/json-schema.md— newFeedbackReporttype + optionalpreviously_suppressed/previous_suppressionfields documented..planning/archive/0.7.0-calibration-evidence-loop.md— the full plan this release implemented.