Speccle — the spec-driven toolkit
Review the spec.
Prove the rest.
AI codegen made code review the bottleneck — reviewers can't trust generated code and tests. Speccle moves human review upstream to the spec and mechanically attests everything downstream.
math — oracle-strength heatmap
open the demo →Line coverage
75%
Oracle strength
62.5%
Gap
12.5%
The problem
Reviewers can't trust generated code — or the tests that ship with it. Review is the jam.
A green test suite means nothing if the tests wouldn't notice the code being wrong. Speccle's answer: keep the human where their judgement matters — the spec — and let deterministic tools prove everything below it. A Speccle tool never calls an LLM. Trust comes from the tools, not the agent.
The spec → merge loop
01 / 07
Spec
Acceptance criteria with [AC-n] ids.
The human writes the acceptance criteria — each one carrying a bracketed [AC-n] token — with the agent assisting. This is the one artifact a human actually reviews.
02 / 07
spec-lint
Deterministic lint over the criteria.
A rule-based linter flags missing [AC-n] ids, weasel wording, compound criteria, and criteria with no measurable subject → outcome. Then a semantic review: the agent proposes, the human ratifies.
03 / 07
Code
The agent writes the implementation.
Implementation is generated against the ratified spec. No human reads it line by line — that's the whole point.
04 / 07
Oracles
Tests per criterion, titles carrying the token.
The agent writes tests for each criterion. A test links to a criterion when its title contains the matching [AC-n] token — that link is what everything downstream measures.
05 / 07
Execute
Vitest coverage + Stryker mutation, diff-scoped.
A normal test run produces the three inputs Speccle reads: the spec markdown, a Stryker mutation report with per-mutant coveredBy/killedBy data, and an Istanbul coverage summary.
06 / 07
heatmap
Oracle strength per criterion — and the routing decision.
The heatmap joins spec, mutation report, and coverage into one ReportModel. Two exits: the machine path (a surviving mutant is a test gap — the agent writes a test and re-runs) or the human path (a weak criterion is a spec problem — the agent drafts a refinement, the human approves). This routing decision is the heart of the loop.
Machine path
A surviving mutant is a test gap. The agent writes a test, re-runs — no human needed.
Human path
A weak criterion is a spec problem. The agent drafts a refined spec; the human approves.
07 / 07
gate
Minimum θ per touched criterion → merge.
A CI mode that requires a minimum oracle strength for every criterion the diff touches, exiting 0 or 1. The literal replacement for code review.
The measure
Oracle strength, not coverage.
A criterion is only strong when the tests naming its [AC-n] token actually kill mutants in the code they cover. Line coverage sits alongside as the naïve baseline — the gap between the two numbers is the whole point.
Line coverage
0%the naïve baseline
Oracle strength
0%killed / total mutants
Gap
12.5%what coverage overstates
8 mutants: 5 killed, 1 survived, 2 uncovered
The toolkit
N deterministic tools. One thin agent.
Three packages are shipped today; the rest of the loop lands in later passes.
spec-lint
ComingA deterministic, rule-based linter for acceptance criteria: missing ids, weasel wording, compound criteria.
gate
ComingMinimum oracle strength per touched criterion, exit 0/1 for CI — the literal replacement for code review.
feedback agent
ComingOne thin, swappable agent driving the tools via their JSON output — the only place an LLM appears.
See the gap in your own suite.
Point speccle at a spec, a Stryker mutation report, and a coverage summary — three files a normal test run already produces.