Skip to content

`crimes@0.7.0` — Calibration & Evidence Loop

Draft release notes for the GitHub Release tagged v0.7.0. The body below is what should go in the Releases page when you cut the tag — that triggers .github/workflows/release.yml and publishes to npm via Trusted Publishing.

crimes@0.7.0 is the calibration release. 0.6.0 added 18 new detector types in one batch; this release ships zero new detectors and instead builds the two feedback mechanisms that turn “crimes runs on a codebase” into “crimes gets better every time it runs on a codebase”:

  • crimes feedback — a single new command that captures per- finding verdicts (tp / fp / known) into .crimes/feedback.jsonl. On fp, it auto-creates a suppression pinned to the current crimes minor. The suppression auto-resurfaces on the next minor bump so you re-confirm or mark resolved — that re-confirmation trajectory is the highest-value calibration data the project collects.
  • The evals/ harness — an agentic test bench of 10 fixtures × 5 scenario kinds × claude + codex CLIs (subscription- authenticated, so no API keys / no per-call billing). Structural rubric scores agent responses; an opt-in judge-model pass adds open-ended scoring. CI replays cached results against PR builds to catch detector-tuning regressions.

Plus housekeeping the §20 dogfood appendix from 0.5.0 flagged but 0.6.0 didn’t ship: direct_date test-file exemption, the two legitimately-large files (reporter/src/human.ts, language-js/src/parse.ts) split, and the §6.2 noise baseline recorded as Appendix B of the calibration plan.

Schema: schema_version stays at "0.1.0". Every new field is strictly additive and back-compat — 0.5.0 and 0.6.0 suppressions files load unchanged.

  • crimes feedback <fingerprint> --verdict {tp|fp|known} [--note] is the write path. fp requires --note; the note becomes the suppression reason.
  • crimes feedback list / summary / export / recheck are the read paths.
  • Inline hint under every finding in human output: Give feedback: crimes feedback <fp> --verdict {tp|fp}. Suppressed on piped output / --no-color / when 5+ entries already exist for the detector (capped so the prompt doesn’t outlive its usefulness).
  • Auto-resurface loop — a fingerprint marked fp in 0.7.x is silent for 0.7.x scans, then resurfaces tagged previously_suppressed: true on the first 0.8.x scan with an alternate ”⚠ Previously marked fp in 0.7” hint and a one-line stderr breadcrumb pointing at crimes feedback recheck.
  • Cross-project rollup at ~/.crimes/feedback-rollup.jsonl via crimes feedback export --append-global. Dedupes by (repo, timestamp, fingerprint); idempotent across runs.
  • Per-detector release-notes mapcrimes feedback recheck prints a hint per resurfaced finding (“direct_date now skips test files — likely resolved.”) so you can decide without re-reading the diff.

Full guide: docs/feedback.md.

  • evals/ directory outside packages/. 10 fixtures (1 symlink, 3 OSS clones, 4 stress, 2 clean controls), 12 representative scenarios across 5 kinds (refactor, bugfix, review, context, plan).
  • pnpm run evals — the orchestrator. Per (fixture × scenario × agent) it runs crimes scan -f json against the fixture, sends the scenario prompt + scan JSON to the agent, captures the response, applies the structural rubric, writes evals/results/<version>/<agent>/<scenario-id>.json.
  • Opt-in --judge — sends the response back to claude in a judging role with the scenario’s judge_questions. Captures {score, reasoning} per question. Validated with zod; malformed answers become failed (score 0) rather than crashing.
  • pnpm run evals:replay + pnpm run evals:diff — re-scores committed results against the current crimes build and emits a markdown diff. Used by .github/workflows/evals-pr.yml to post a PR comment with per-agent pass-rate moves outside the ±10% tolerance band.
  • No agents in CI, no secrets, no API keys. Fresh runs happen on the maintainer’s machine via the locally-installed CLIs (claude -p ... --output-format json + codex exec --json ...) against existing subscriptions.

Full guide: docs/evals.md.

  • direct_date skips test files (§20 dogfood false positive closed). New shared helper at packages/core/src/util/test-files.ts consolidates the 8 copies of the test-file regex that were scattered across detectors and the scoring / petty indices.
  • reporter/src/human.ts split into 10 files under human/ (scan.ts, context.ts, hotspots.ts, diff.ts, baseline.ts, verdict.ts, explain.ts, audit.ts, shared.ts, index.ts). Every file lands under 200 lines; bundled-fixture output is byte-identical to pre-split.
  • language-js/src/parse.ts split into 12 files under parse/ (index.ts orchestrator, types.ts, constants.ts, utils.ts, functions.ts, shapes.ts, shape-predicates.ts, shape-commander.ts, dates.ts, nav.ts, ui-strings.ts, jsx.ts). Every file under 250 lines; scan JSON output is byte-identical to pre-split.
  • No new detectors. Period. The 0.6.0 slate is what we calibrate for the next ~3 months. Detector changes in 0.7.0 are exemptions and split-only refactors.
  • No schema bump. Every addition is optional and back-compat — consumers don’t have to update anything to keep working.
  • No new commands beyond crimes feedback. The rest of the CLI surface is unchanged from 0.6.0.
  • No Python. Open question in PRD §26, deferred again.
  • No LLM-assisted detector modes. Evals call LLMs; detectors stay deterministic. Wedge protection.
  • No hosted feedback collector. Everything stays local on your machine.

The calibration data this release collects becomes the inputs to 0.8.0’s detector tuning:

  • Threshold / severity adjustments based on crimes feedback summary --global outputs.
  • Resurface trajectories from crimes feedback recheck (fp → tp = win; fp → fp = candidate for further tuning).
  • Agent-behaviour regressions caught by the PR replay workflow.

The release notes will quote the specific calibration deltas that drove each tuning change — that’s the value of the loop you’re participating in by upgrading.

Terminal window
npm install -g crimes@0.7.0
crimes --version # crimes@0.7.0

If you previously used crimes ignore to suppress findings — those suppressions all stay source: "manual" and never resurface. The new mechanism only kicks in for entries written by crimes feedback ... --verdict fp (which add source: "feedback" + crimes_version_pinned).

After upgrading, run crimes scan in a project you already use crimes on; every finding in the human output now carries the Give feedback: ... hint. Mark a few that surprised you and you’ve started the loop.