Speccle

Speccle — the spec-driven toolkit

Review the spec.
Prove the rest.

AI codegen made code review the bottleneck — reviewers can't trust generated code and tests. Speccle moves human review upstream to the spec and mechanically attests everything downstream.

math — oracle-strength heatmap

open the demo →

Line coverage

75%

Oracle strength

62.5%

Gap

12.5%

AC-1100%add returns the sum of its two arguments.
AC-250%sub returns the difference of its two arguments.
AC-3pow raises the base to the exponent.

The problem

Reviewers can't trust generated code — or the tests that ship with it. Review is the jam.

A green test suite means nothing if the tests wouldn't notice the code being wrong. Speccle's answer: keep the human where their judgement matters — the spec — and let deterministic tools prove everything below it. A Speccle tool never calls an LLM. Trust comes from the tools, not the agent.

The spec → merge loop

01 / 07

Spec

[human][agent]

Acceptance criteria with [AC-n] ids.

The human writes the acceptance criteria — each one carrying a bracketed [AC-n] token — with the agent assisting. This is the one artifact a human actually reviews.

02 / 07

spec-lint

[tool]Coming

Deterministic lint over the criteria.

A rule-based linter flags missing [AC-n] ids, weasel wording, compound criteria, and criteria with no measurable subject → outcome. Then a semantic review: the agent proposes, the human ratifies.

03 / 07

Code

[agent]

The agent writes the implementation.

Implementation is generated against the ratified spec. No human reads it line by line — that's the whole point.

04 / 07

Oracles

[agent]

Tests per criterion, titles carrying the token.

The agent writes tests for each criterion. A test links to a criterion when its title contains the matching [AC-n] token — that link is what everything downstream measures.

05 / 07

Execute

[tool]

Vitest coverage + Stryker mutation, diff-scoped.

A normal test run produces the three inputs Speccle reads: the spec markdown, a Stryker mutation report with per-mutant coveredBy/killedBy data, and an Istanbul coverage summary.

06 / 07

heatmap

[tool]Shipped

Oracle strength per criterion — and the routing decision.

The heatmap joins spec, mutation report, and coverage into one ReportModel. Two exits: the machine path (a surviving mutant is a test gap — the agent writes a test and re-runs) or the human path (a weak criterion is a spec problem — the agent drafts a refinement, the human approves). This routing decision is the heart of the loop.

Machine path

A surviving mutant is a test gap. The agent writes a test, re-runs — no human needed.

Human path

A weak criterion is a spec problem. The agent drafts a refined spec; the human approves.

07 / 07

gate

[tool]Coming

Minimum θ per touched criterion → merge.

A CI mode that requires a minimum oracle strength for every criterion the diff touches, exiting 0 or 1. The literal replacement for code review.

The measure

Oracle strength, not coverage.

A criterion is only strong when the tests naming its [AC-n] token actually kill mutants in the code they cover. Line coverage sits alongside as the naïve baseline — the gap between the two numbers is the whole point.

Line coverage

0%

the naïve baseline

Oracle strength

0%

killed / total mutants

Gap

12.5%

what coverage overstates

8 mutants: 5 killed, 1 survived, 2 uncovered

The toolkit

N deterministic tools. One thin agent.

Three packages are shipped today; the rest of the loop lands in later passes.

spec-lint

Coming

A deterministic, rule-based linter for acceptance criteria: missing ids, weasel wording, compound criteria.

gate

Coming

Minimum oracle strength per touched criterion, exit 0/1 for CI — the literal replacement for code review.

feedback agent

Coming

One thin, swappable agent driving the tools via their JSON output — the only place an LLM appears.

See the gap in your own suite.

Point speccle at a spec, a Stryker mutation report, and a coverage summary — three files a normal test run already produces.