P1.5.8 — Verification (Step 5 of 5)

Captures test evidence and parity matrix for the cross-model parity suite shipped in step 4.

1. Test count delta

Anchor Count
Baseline (main @ 6cfd269b) 3353
After P1.5.8 (HEAD) 3464
Delta +111

The packet projected ~108 net new tests; the implementation lands at +111 (which includes 3 driver-registry sanity tests added in the implement step to defend the helper contract).

2. Gate results

Gate Result
npm run build green — tsc exits 0; postbuild copies migrations
npm run lint green — eslint src exits 0; zero warnings
npm test (full) green — 3464 / 3464 pass; 78 / 78 suites pass; 46.6 s wall-clock
npm test (parity-only) green — 111 / 111 pass; 5.8 s wall-clock

2.1 Flake notes

A single full-suite run earlier in the verification cycle hit consensus/parity-harness.test.ts › G7.1, reputation/tokens.test.ts › ..., and db/migrations/009-model-candidates.test.ts › ... flakes. Both are pre-existing (documented in the dispatch slice as well-known and retry-clean) and unrelated to this slice. A clean retry succeeded (3464 / 3464). The parity suite ran clean on EVERY attempt across the verification cycle (5 invocations: 1 in-isolation under npx jest parity.test.ts + 4 inside the full suite).

3. Parity matrix — invariants × adapters

The suite asserts 8 contract blocks × 4 adapters + 4 cross-cutting + 3 driver-registry sanity = 111 tests. Each contract block contains multiple internal expect assertions.

Invariant block Description Claude Kimi Codex OpenAI
P1 — shape parity 6 spec fields + types + no extras (3 tests) yes yes yes yes
P2 — determinism Two identical fixtures → equal results (1 test) yes yes yes yes
P3 — token accounting prompt/completion tokens + missing usage degrades (3 tests) yes yes yes yes
P4 — stop-reason mapping success + tool-use + always-string (3 tests) yes yes yes yes
P5 — tool-use mapping content shape + input object + multi-tool + empty-tools (4 tests) yes yes yes yes
P6 — error mapping 401 + 500-retries-exhausted + network + missing-key + Error subclass (5 tests) yes yes yes yes
P7 — latency non-negative + finite + populated (3 tests) yes yes yes yes
P8 — injection seams fetchFn + logger + delayFn + apiKey (4 tests) yes yes yes yes

Per driver = 26 tests · 4 drivers = 104 driver-parity tests.

Cross-cutting (4 tests over all 4 adapters jointly)

# Test Verdict
C1 All 4 adapters return structurally equal CompletionResult shape pass
C2 All 4 adapters yield identical token counts pass
C3 All 4 adapters return string-typed stopReason pass
C4 All 4 adapters emit JSON-stringified tool_use[] array pass

Driver-registry sanity (3 tests)

# Test Verdict
R1 Exactly 4 drivers registered pass
R2 Driver names = ['claude', 'codex', 'kimi', 'openai'] pass
R3 Every driver exposes the parity-contract surface pass

Total: 104 + 4 + 3 = 111 tests, all green.

4. Stop-reason value divergence (documented, asserted)

Per contract §P4.2, OpenAI’s adapter passes finish_reason through verbatim (openai.ts:441-443), while Claude, Kimi, and Codex normalize to the Anthropic vocabulary ('end_turn', 'tool_use', 'max_tokens').

The cross-cutting test C3 asserts this explicitly:

const reasonValues = reasons.map((r) => r.reason).sort();
expect(reasonValues).toEqual(['end_turn', 'end_turn', 'end_turn', 'stop']);

That is, 3 of 4 adapters produce 'end_turn'; 1 (OpenAI) produces 'stop'. The shape invariant (P1: stopReason is a string) is uniformly maintained — only the value normalization diverges, and the suite asserts that divergence explicitly.

5. Sibling-race compliance

src/domains/router/fallback.ts is UNTOUCHED. Verified by:

git diff --stat HEAD -- src/domains/router/fallback.ts
(empty)

This satisfies the dispatch override’s CRITICAL constraint.

Adapter source files are also untouched: claude.ts, kimi.ts, codex.ts, openai.ts — all unchanged from base 6cfd269b.

6. Files touched (final)

File Status LoC
docs/audits/p1-5-8-parity-audit.md new 174
docs/contracts/p1-5-8-parity-contract.md new 211
docs/packets/p1-5-8-parity-packet.md new 288
src/__tests__/domains/router/parity-helpers.ts new 596
src/__tests__/domains/router/parity.test.ts new 809
docs/verification/p1-5-8-parity-verification.md new (this file)

Production code touched: zero files.

7. Kimi-flake decision

Per the dispatch override §”Optional clarifications”:

Optional: fix the kimi.test.ts ● injection seams › 7. latency measurement: 50ms delay → latencyMs >= 50 flake introduced in P1.5.2 W3. … if the staging file’s slice doesn’t authorize touching the existing adapter test files, leave the flake and document in verification doc; the parity suite is the right successor.

Decision: deferred. Rationale:

  1. Editing src/__tests__/domains/router/adapters/kimi.test.ts is out-of-slice scope (P1.5.2 territory).
  2. The parity suite’s P7.1 test asserts result.latencyMs >= 0 instead of >= delay, so the kimi-class flake CANNOT recur in the parity suite by construction.
  3. The kimi adapter test continues to assert >= 50 against a 50 ms delay, which remains the documented brittle pattern. Future round may relax that to >= 45 or use a deterministic clock injection.

Net effect: the parity suite is the right successor to that assertion. The historical test is left as-is per the override.

8. Branch + commit log

Branch: feature/p1-5-8-parity Worktree: .worktrees/claude/p1-5-8-parity Base: origin/main @ 6cfd269b

Step Commit
1. Audit audit(p1-5-8-parity): inventory cross-model parity surface
2. Contract contract(p1-5-8-parity): behavioral contract for cross-model parity
3. Packet packet(p1-5-8-parity): execution plan
4. Implement feat(p1-5-8-parity): cross-model parity test suite (4 adapters)
5. Verify (this commit) — verify(p1-5-8-parity): test evidence + parity matrix

9. Acceptance — slice criteria

From the dispatch override §”Test gate”:

  • npm run build && npm run lint && npm test — all gates green.
  • Baseline 3353 tests preserved; +111 new tests; 3464 total.
  • No regression on existing tests (flakes resolve on retry).
  • No mutation to src/domains/router/fallback.ts.
  • Parity matrix covers all 4 adapters × 8 invariant blocks.

10. Closing

P1.5.8 ships 111 parity tests spanning shape, determinism, tokens, stop-reason, tool-use mapping, error mapping, latency, and injection seams across all four δ adapters. The router’s CompletionFn boundary is now provably interchangeable across Claude / Kimi / Codex / OpenAI: adapters are bit-shape-compatible at the result level, error-class discriminants are uniform-by-shape, and tool-use translation produces identical Anthropic-shaped content from divergent wire shapes.

The slice ships file-disjoint from sibling P1.5.10 (ζ integration modifies fallback.ts; this slice does not). Both can merge sequentially without conflict.


Back to top

Colibri — documentation-first MCP runtime. Apache 2.0 + Commons Clause.

This site uses Just the Docs, a documentation theme for Jekyll.