P1.5.8 — Contract (Step 2 of 5)
Behavioral contract for the cross-model parity suite. Pins each parity invariant to a planned test name. Step 3 (packet) translates these into an execution plan; Step 4 (implement) wires the file.
1. Parity invariants
For each adapter A ∈ {claude, kimi, codex, openai}:
P1 — Shape parity (golden path)
| Invariant | Description |
|---|---|
| P1.1 | (adapter, prompt) → CompletionResult returns an object with the 6 fields from claude.ts:134-141: content, model, promptTokens, completionTokens, latencyMs, stopReason. |
| P1.2 | Each field has the spec-defined type: content string, model string, promptTokens non-negative integer, completionTokens non-negative integer, latencyMs non-negative finite number, stopReason string. |
| P1.3 | The returned object has NO extra enumerable own properties beyond the 6 specified. |
P2 — Determinism parity
| Invariant | Description |
|---|---|
| P2.1 | Calling adapter twice with identical (prompt, mocked HTTP response) yields two CompletionResults with structurally equal content, model, promptTokens, completionTokens, stopReason (NOT latencyMs which is wall-clock-derived). |
P3 — Token-accounting parity
| Invariant | Description |
|---|---|
| P3.1 | promptTokens is populated from the wire’s native location (Anthropic usage.input_tokens; Kimi/Codex/OpenAI usage.prompt_tokens); given matching mocked counts, all 4 adapters yield numerically equal promptTokens. |
| P3.2 | Same for completionTokens (Anthropic output_tokens; OpenAI-shape completion_tokens). |
| P3.3 | Missing usage block degrades to 0/0 (not NaN, not throw). |
P4 — Stop-reason normalization parity
| Invariant | Description |
|---|---|
| P4.1 | Claude stop_reason: 'end_turn' → 'end_turn'. |
| P4.2 | OpenAI-shape finish_reason: 'stop' → normalized to 'end_turn' by Kimi + Codex adapters. (OpenAI adapter passes 'stop' through unchanged — see openai.ts:441-443.) |
| P4.3 | OpenAI-shape finish_reason: 'tool_calls' → 'tool_use' (Kimi + Codex). (OpenAI passes through unchanged.) |
| P4.4 | OpenAI-shape finish_reason: 'length' → 'max_tokens' (Kimi + Codex); OpenAI passes through. |
| P4.5 | Missing finish_reason → adapter-specific default (‘unknown’). All 4 adapters tolerate this and return a string, never throw. |
Note: The OpenAI adapter at openai.ts:441-443 passes finish_reason
through verbatim (P4.2/3/4 are not the SAME normalized vocabulary
across all four). The parity suite tests this as a documented
divergence: Kimi + Codex normalize to Anthropic vocab; OpenAI passes
through. The shape invariant (P1: stopReason is a string) holds across
all four — only the value mapping diverges. The contract calls out this
divergence as expected and tests it explicitly.
P5 — Tool-use mapping parity
| Invariant | Description |
|---|---|
| P5.1 | When mocked with a tool-use response, every adapter returns CompletionResult whose content is a JSON-stringified array of Anthropic-shape tool_use blocks: {type: 'tool_use', id, name, input}. |
| P5.2 | The input field is a parsed object (not a JSON string). |
| P5.3 | Multiple tool calls in one response produce multiple tool_use blocks in content, preserving order. |
| P5.4 | An empty tools array degrades to a plain (non-tool-use) call: tools key omitted from the request body. |
P6 — Error mapping parity
| Invariant | Description |
|---|---|
| P6.1 | HTTP 401 → adapter-specific error class with terminal code (ANTHROPIC_API_ERROR / KIMI_API_ERROR / CODEX_API_ERROR / OPENAI_API_ERROR). |
| P6.2 | HTTP 500 → after retries, adapter-specific error class with _RETRIES_EXHAUSTED code. |
| P6.3 | Network-level fetch error → adapter-specific error class with _API_ERROR code, status: undefined. |
| P6.4 | Missing API key → adapter-specific *ConfigError (ANTHROPIC_CONFIG_ERROR / KIMI_CONFIG_ERROR / CODEX_CONFIG_ERROR / OPENAI_CONFIG_ERROR). |
| P6.5 | All four adapter error classes extend Error and have a code discriminant property. |
P7 — Latency parity
| Invariant | Description |
|---|---|
| P7.1 | Every successful result has latencyMs >= 0 (NOT testing the lower bound as the kimi-flake demonstrates timer imprecision). |
| P7.2 | latencyMs is a finite number (Number.isFinite(result.latencyMs)). |
| P7.3 | latencyMs is populated even on the success path with zero-delay mocks (i.e. the adapter records latency from request-start to response-parse, not from request-start to mock-resolve). |
P8 — Injection seam parity
| Invariant | Description |
|---|---|
| P8.1 | All four adapters accept fetchFn and use it exclusively (no global fetch calls during tests). |
| P8.2 | All four adapters accept logger and route their log lines through it (success path logs an info line with adapter tag, tokens, latency). |
| P8.3 | All four adapters accept delayFn and call it for retry sleeps (verifiable by counting delayFn calls under a 500 → 500 → 500 → 200 fixture). |
| P8.4 | All four adapters accept apiKey and prefer it over the env var. |
2. Test surface (planned)
The parity suite ships as ONE file:
src/__tests__/domains/router/parity.test.ts.
A second helper file may be needed for shared fixtures:
src/__tests__/domains/router/parity-helpers.ts.
2.1 Adapter-driver abstraction
Each adapter is exercised through a ParityDriver interface that hides
the per-adapter signature divergence:
interface ParityDriver {
readonly name: 'claude' | 'kimi' | 'codex' | 'openai';
callPlain: (prompt: string, opts: ParityCallOptions) => Promise<CompletionResult>;
callWithTools: (prompt: string, tools: AnthropicTool[], opts: ParityCallOptions) => Promise<CompletionResult>;
makeSuccessResponse: () => unknown; // adapter-native HTTP body
makeToolUseResponse: () => unknown; // tool-call wire shape
makeMissingUsageResponse: () => unknown; // missing usage block
parseErrorClass: ErrorConstructor; // adapter-specific error class
configErrorClass: ErrorConstructor; // adapter-specific config error
apiErrorCode: string; // 'ANTHROPIC_API_ERROR' / etc.
retriesExhaustedCode: string; // '_RETRIES_EXHAUSTED' variant
configErrorCode: string; // '_CONFIG_ERROR' variant
}
The shared mock-fetch helper is reused from the existing per-adapter
suites’ pattern (makeMockFetch(responses)).
2.2 Test matrix
The suite runs describe.each(drivers) over the 4 drivers, with these
test blocks per driver:
| # | Block | Invariants asserted |
|---|---|---|
| 1 | P1 — shape parity (golden path) | P1.1, P1.2, P1.3 |
| 2 | P2 — determinism | P2.1 |
| 3 | P3 — token accounting | P3.1, P3.2, P3.3 |
| 4 | P4 — stop-reason parity | P4.1, P4.2, P4.3, P4.4, P4.5 |
| 5 | P5 — tool-use mapping | P5.1, P5.2, P5.3, P5.4 |
| 6 | P6 — error mapping (401/500/net) | P6.1, P6.2, P6.3, P6.4, P6.5 |
| 7 | P7 — latency | P7.1, P7.2, P7.3 |
| 8 | P8 — injection seams (fetch/log/delay/key) | P8.1, P8.2, P8.3, P8.4 |
4 adapters × 8 blocks = 32 driver-parity test cases. Some blocks contain multiple expect-assertions internally; the test count surfaces in the verification doc.
2.3 Cross-cutting tests (single-block, all 4 adapters in one test)
| # | Test name | Invariant span |
|---|---|---|
| C1 | all 4 adapters return structurally equal CompletionResult shape on success |
P1.1, P1.2, P1.3 jointly |
| C2 | all 4 adapters yield identical token counts given equivalent mocked usage |
P3.1, P3.2 jointly |
| C3 | all 4 adapters return a string-typed stopReason given a 'stop' / 'end_turn' fixture |
P4 envelope |
| C4 | all 4 adapters emit JSON-stringified tool_use array on tool-use response |
P5.1 jointly |
4 cross-cutting tests running joint assertions over the 4 adapters
in one test() block.
2.4 Helper function — adapter-driver registry
const DRIVERS: ParityDriver[] = [
makeClaudeDriver(),
makeKimiDriver(),
makeCodexDriver(),
makeOpenAiDriver(),
];
Each make*Driver returns a frozen object satisfying the interface above.
3. Determinism + isolation
- Every fetch call is mocked via injected
fetchFn; no global stubbing. delayFnis an instant no-op; retry tests never sleep.loggeris silent (records into an in-memory array for assertions that demand it).apiKeyis a fixed dummy string per adapter; no env var reads (the*_API_KEYenv is overridden via the options bag).- Tests do NOT call
resetRouterStats()because they do not exercise the router’s cost layer — they call adapters directly.
4. Untouched files (CRITICAL)
src/domains/router/fallback.ts— sibling P1.5.10 territory.src/domains/router/adapters/*.ts— Wave 3 closed.src/domains/integrations/claude.ts— Phase 0 closed.src/domains/router/tools.ts— P1.5.7 closed; sibling P1.5.10 may touch.
5. Files created
| File | Role |
|---|---|
src/__tests__/domains/router/parity.test.ts |
Main parity test suite |
src/__tests__/domains/router/parity-helpers.ts |
Shared fixtures + driver impls |
No production code is modified.
6. Acceptance gate
npm run build && npm run lint && npm test MUST pass. Pre-existing
known flakes (kimi latency, consensus G7.1, reputation race, server
startup) retry-clean — same baseline as 3353 tests at 6cfd269b.
Expected test count delta: +32 driver-parity + 4 cross-cutting + a few N-of-1 fixture-shape tests = approximately +40 tests, yielding 3393. Final count recorded in Step 5 (verification).