P1.5.8 — Verification (Step 5 of 5)
Captures test evidence and parity matrix for the cross-model parity suite shipped in step 4.
1. Test count delta
| Anchor | Count |
|---|---|
Baseline (main @ 6cfd269b) |
3353 |
| After P1.5.8 (HEAD) | 3464 |
| Delta | +111 |
The packet projected ~108 net new tests; the implementation lands at +111 (which includes 3 driver-registry sanity tests added in the implement step to defend the helper contract).
2. Gate results
| Gate | Result |
|---|---|
npm run build |
green — tsc exits 0; postbuild copies migrations |
npm run lint |
green — eslint src exits 0; zero warnings |
npm test (full) |
green — 3464 / 3464 pass; 78 / 78 suites pass; 46.6 s wall-clock |
npm test (parity-only) |
green — 111 / 111 pass; 5.8 s wall-clock |
2.1 Flake notes
A single full-suite run earlier in the verification cycle hit
consensus/parity-harness.test.ts › G7.1,
reputation/tokens.test.ts › ..., and
db/migrations/009-model-candidates.test.ts › ... flakes. Both are
pre-existing (documented in the dispatch slice as well-known and
retry-clean) and unrelated to this slice. A clean retry succeeded
(3464 / 3464). The parity suite ran clean on EVERY attempt across the
verification cycle (5 invocations: 1 in-isolation under
npx jest parity.test.ts + 4 inside the full suite).
3. Parity matrix — invariants × adapters
The suite asserts 8 contract blocks × 4 adapters + 4 cross-cutting + 3
driver-registry sanity = 111 tests. Each contract block contains
multiple internal expect assertions.
| Invariant block | Description | Claude | Kimi | Codex | OpenAI |
|---|---|---|---|---|---|
| P1 — shape parity | 6 spec fields + types + no extras (3 tests) | yes | yes | yes | yes |
| P2 — determinism | Two identical fixtures → equal results (1 test) | yes | yes | yes | yes |
| P3 — token accounting | prompt/completion tokens + missing usage degrades (3 tests) | yes | yes | yes | yes |
| P4 — stop-reason mapping | success + tool-use + always-string (3 tests) | yes | yes | yes | yes |
| P5 — tool-use mapping | content shape + input object + multi-tool + empty-tools (4 tests) | yes | yes | yes | yes |
| P6 — error mapping | 401 + 500-retries-exhausted + network + missing-key + Error subclass (5 tests) | yes | yes | yes | yes |
| P7 — latency | non-negative + finite + populated (3 tests) | yes | yes | yes | yes |
| P8 — injection seams | fetchFn + logger + delayFn + apiKey (4 tests) | yes | yes | yes | yes |
Per driver = 26 tests · 4 drivers = 104 driver-parity tests.
Cross-cutting (4 tests over all 4 adapters jointly)
| # | Test | Verdict |
|---|---|---|
| C1 | All 4 adapters return structurally equal CompletionResult shape | pass |
| C2 | All 4 adapters yield identical token counts | pass |
| C3 | All 4 adapters return string-typed stopReason | pass |
| C4 | All 4 adapters emit JSON-stringified tool_use[] array | pass |
Driver-registry sanity (3 tests)
| # | Test | Verdict |
|---|---|---|
| R1 | Exactly 4 drivers registered | pass |
| R2 | Driver names = ['claude', 'codex', 'kimi', 'openai'] |
pass |
| R3 | Every driver exposes the parity-contract surface | pass |
Total: 104 + 4 + 3 = 111 tests, all green.
4. Stop-reason value divergence (documented, asserted)
Per contract §P4.2, OpenAI’s adapter passes finish_reason through
verbatim (openai.ts:441-443), while Claude, Kimi, and Codex normalize
to the Anthropic vocabulary ('end_turn', 'tool_use', 'max_tokens').
The cross-cutting test C3 asserts this explicitly:
const reasonValues = reasons.map((r) => r.reason).sort();
expect(reasonValues).toEqual(['end_turn', 'end_turn', 'end_turn', 'stop']);
That is, 3 of 4 adapters produce 'end_turn'; 1 (OpenAI) produces
'stop'. The shape invariant (P1: stopReason is a string) is
uniformly maintained — only the value normalization diverges, and the
suite asserts that divergence explicitly.
5. Sibling-race compliance
src/domains/router/fallback.ts is UNTOUCHED. Verified by:
git diff --stat HEAD -- src/domains/router/fallback.ts
(empty)
This satisfies the dispatch override’s CRITICAL constraint.
Adapter source files are also untouched: claude.ts, kimi.ts,
codex.ts, openai.ts — all unchanged from base 6cfd269b.
6. Files touched (final)
| File | Status | LoC |
|---|---|---|
docs/audits/p1-5-8-parity-audit.md |
new | 174 |
docs/contracts/p1-5-8-parity-contract.md |
new | 211 |
docs/packets/p1-5-8-parity-packet.md |
new | 288 |
src/__tests__/domains/router/parity-helpers.ts |
new | 596 |
src/__tests__/domains/router/parity.test.ts |
new | 809 |
docs/verification/p1-5-8-parity-verification.md |
new | (this file) |
Production code touched: zero files.
7. Kimi-flake decision
Per the dispatch override §”Optional clarifications”:
Optional: fix the
kimi.test.ts ● injection seams › 7. latency measurement: 50ms delay → latencyMs >= 50flake introduced in P1.5.2 W3. … if the staging file’s slice doesn’t authorize touching the existing adapter test files, leave the flake and document in verification doc; the parity suite is the right successor.
Decision: deferred. Rationale:
- Editing
src/__tests__/domains/router/adapters/kimi.test.tsis out-of-slice scope (P1.5.2 territory). - The parity suite’s
P7.1test assertsresult.latencyMs >= 0instead of>= delay, so the kimi-class flake CANNOT recur in the parity suite by construction. - The kimi adapter test continues to assert
>= 50against a 50 ms delay, which remains the documented brittle pattern. Future round may relax that to>= 45or use a deterministic clock injection.
Net effect: the parity suite is the right successor to that assertion. The historical test is left as-is per the override.
8. Branch + commit log
Branch: feature/p1-5-8-parity
Worktree: .worktrees/claude/p1-5-8-parity
Base: origin/main @ 6cfd269b
| Step | Commit |
|---|---|
| 1. Audit | audit(p1-5-8-parity): inventory cross-model parity surface |
| 2. Contract | contract(p1-5-8-parity): behavioral contract for cross-model parity |
| 3. Packet | packet(p1-5-8-parity): execution plan |
| 4. Implement | feat(p1-5-8-parity): cross-model parity test suite (4 adapters) |
| 5. Verify | (this commit) — verify(p1-5-8-parity): test evidence + parity matrix |
9. Acceptance — slice criteria
From the dispatch override §”Test gate”:
npm run build && npm run lint && npm test— all gates green.- Baseline 3353 tests preserved; +111 new tests; 3464 total.
- No regression on existing tests (flakes resolve on retry).
- No mutation to
src/domains/router/fallback.ts. - Parity matrix covers all 4 adapters × 8 invariant blocks.
10. Closing
P1.5.8 ships 111 parity tests spanning shape, determinism, tokens,
stop-reason, tool-use mapping, error mapping, latency, and injection
seams across all four δ adapters. The router’s CompletionFn boundary
is now provably interchangeable across Claude / Kimi / Codex / OpenAI:
adapters are bit-shape-compatible at the result level, error-class
discriminants are uniform-by-shape, and tool-use translation produces
identical Anthropic-shaped content from divergent wire shapes.
The slice ships file-disjoint from sibling P1.5.10 (ζ integration
modifies fallback.ts; this slice does not). Both can merge
sequentially without conflict.