P1.5.5 — N-member Fallback Chain + Circuit Breaker — Behavioral Contract
1. Module surface
1.1. src/domains/router/circuit.ts (new)
export const CIRCUIT_FAILURE_THRESHOLD: 3;
export const CIRCUIT_COOLDOWN_MS: 60_000;
export interface CircuitState {
readonly failures: number;
readonly openedAt: number | null;
}
export interface CircuitBreakerOptions {
readonly nowFn?: () => number; // injectable clock; default Date.now
}
export function recordFailure(modelId: ModelId, options?: CircuitBreakerOptions): void;
export function recordSuccess(modelId: ModelId): void;
export function isOpen(modelId: ModelId, options?: CircuitBreakerOptions): boolean;
export function resetIfElapsed(modelId: ModelId, options?: CircuitBreakerOptions): void;
export function resetCircuitBreaker(modelId?: ModelId): void;
export function snapshot(): ReadonlyMap<ModelId, CircuitState>;
export function getCircuitBreakerState(): ReadonlyMap<ModelId, CircuitState>; // alias of snapshot()
The module owns a private Map<ModelId, CircuitState> initialised lazily on first access. All mutations are confined to the module — callers never write to the map directly.
1.2. src/domains/router/fallback.ts (modified body)
All existing exports preserved byte-identical in name/signature:
routeRequest(prompt, options?) → Promise<RouteResult>FallbackChainExhaustedErrorRouteOptions,RouteResult,FallbackAttempt,CompletionFn,CompletionFnOptions,ScoringFnROUTER_PHASE_0_SHAPE(the name; the literal values change)
New exports (P1.5.5 additions; consumed by P1.5.7 + tests):
// Re-exports of circuit.ts members at the fallback module level so
// the public δ surface stays a single import path.
export {
CIRCUIT_FAILURE_THRESHOLD,
CIRCUIT_COOLDOWN_MS,
getCircuitBreakerState,
resetCircuitBreaker,
} from './circuit.js';
export type { CircuitState } from './circuit.js';
// New error types raised by the chain walk (NOT raised on top-level
// success path — callers see them only when inspecting attempts[]).
export class RouterTimeoutError extends Error { readonly code = 'ROUTER_TIMEOUT' as const; ... }
export class CircuitOpenError extends Error { readonly code = 'CIRCUIT_OPEN' as const; ... }
export class NoAdapterError extends Error { readonly code = 'NO_ADAPTER' as const; ... }
These three internal-failure-mode errors join AnthropicApiError / AnthropicConfigError / KimiApiError / etc as legal values inside FallbackAttempt.error. They are not part of the top-level thrown surface — the top-level surface remains FallbackChainExhaustedError (and Error for non-Error throwables, normalized at attempt time).
1.3. src/domains/router/index.ts (Wave 3 fold-in)
Added at the end of the barrel, in alphabetical order:
export * from './adapters/codex.js';
export * from './adapters/kimi.js';
export * from './adapters/openai.js';
The pre-existing export * from './scoring.js' and export * from './fallback.js' lines remain unchanged. Order matters only for symbol-conflict resolution; alphabetical was specified by the dispatch packet.
Symbol-conflict note: All four adapter modules (claude / kimi / codex / openai) re-export CompletionResult from ../integrations/claude.ts (or, in openai’s case, also CompletionResult directly). With export * from multiple modules that all re-export the same symbol from the same upstream, TypeScript de-duplicates the re-export and the symbol resolves identically — no conflict. The same is true for AnthropicTool (kimi/codex re-export it; openai does not). The CHANGED literals in ROUTER_PHASE_0_SHAPE re-exported via fallback.js are picked up unconditionally; downstream consumers see the Phase 1.5 shape.
2. ROUTER_PHASE_0_SHAPE flip
Phase 1.5 literal:
export const ROUTER_PHASE_0_SHAPE: {
readonly members: 6;
readonly hasCircuitBreaker: true;
readonly modelsSupported: readonly [
'claude',
'claude-haiku-3-5',
'claude-sonnet-3-5',
'gpt-4o',
'gpt-4o-mini',
'kimi-k2',
];
} = Object.freeze({
members: 6,
hasCircuitBreaker: true,
modelsSupported: Object.freeze([
'claude',
'claude-haiku-3-5',
'claude-sonnet-3-5',
'gpt-4o',
'gpt-4o-mini',
'kimi-k2',
] as const),
} as const);
The modelsSupported array is the set of ModelIds for which the default adapter registry has a concrete CompletionFn. members equals modelsSupported.length by construction.
Six entries — not nine — because three ModelId values ('gemini-1-5-pro', 'llama-3-3-70b', 'mixtral-8x22b') lack adapters and Codex is wired into the registry but does not yet correspond to a ModelId value (registry key by abstract router ID). The flipped marker tracks “what the chain can actually call”, not “every ModelId in the union”.
3. Circuit breaker FSM
┌──────────┐ recordFailure ×3 ┌─────────────┐ resetIfElapsed (60s passed) ┌──────────┐
│ CLOSED │ ──────────────────▶ │ OPEN │ ─────────────────────────────▶│ CLOSED │
│ (fail<3) │ │ (opened60s) │ │ (fail=0) │
└──────────┘ └─────────────┘ └──────────┘
▲ │
│ │ isOpen() ┌──────────┐
│ └───────────────────────────────────────│ SKIP │
│ └──────────┘
│ recordSuccess │
└────────────────────────────────────────────────────────────────────────────────┘
(zero failures; openedAt unchanged)
3.1. State transitions
| Trigger | Pre-state | Post-state | Notes |
|---|---|---|---|
recordFailure from {failures: 0, openedAt: null} |
CLOSED-0 | {failures: 1, openedAt: null} |
Counter increment only. |
recordFailure from {failures: 1} |
CLOSED-1 | {failures: 2, openedAt: null} |
Counter increment only. |
recordFailure from {failures: 2} |
CLOSED-2 | {failures: 3, openedAt: now} |
Counter reaches threshold ⇒ trip. |
recordFailure from {failures: 3, openedAt: t0} |
OPEN | {failures: 4, openedAt: t0} |
openedAt is NOT updated on further failures during open. The cooldown anchors to the FIRST trip. |
recordSuccess from anything |
* | {failures: 0, openedAt: unchanged} |
Counter reset only. openedAt preserved — open-state remains time-bound. |
resetIfElapsed(t1) with t1 - openedAt >= 60_000 |
OPEN | {failures: 0, openedAt: null} |
Time-bound clear. Always called BEFORE isOpen in the chain walk. |
resetIfElapsed(t1) with t1 - openedAt < 60_000 |
OPEN | (unchanged) | No-op. |
resetCircuitBreaker(modelId) |
* | {failures: 0, openedAt: null} |
Manual clear, single model. |
resetCircuitBreaker() |
* (all) | (empty map) | Manual clear, all models. |
3.2. isOpen predicate
function isOpen(modelId, { nowFn = Date.now } = {}): boolean {
const state = stateMap.get(modelId);
if (!state || state.openedAt === null) return false;
return (nowFn() - state.openedAt) < CIRCUIT_COOLDOWN_MS;
}
After resetIfElapsed, state.openedAt is null so isOpen returns false. The chain walk therefore always calls resetIfElapsed first, then isOpen, then the adapter attempt.
3.3. Invariants
- I-CB-1 (Initialization): Before any call,
snapshot()is an empty map. - I-CB-2 (Threshold): A model trips iff
failuresreaches exactlyCIRCUIT_FAILURE_THRESHOLD. Two failures + one success + one failure does NOT trip. - I-CB-3 (Cooldown anchor):
openedAtis set on the trip and not advanced by subsequent failures during the open window. - I-CB-4 (Time-bound reset): An elapsed cooldown clears state. A successful call during open does NOT clear
openedAtbecauseisOpenblocked the attempt —recordSuccessis never called for a skipped model. - I-CB-5 (Per-model): State is keyed by
ModelId. Two models share no state. - I-CB-6 (Memory-only): No DB write, no file write, no process-shared state. Process exit clears.
- I-CB-7 (Clock injection):
nowFnis the only clock seam. All clock reads route through it;Date.nowonly used whennowFnis absent.
4. Chain-walk semantics
4.1. Algorithm
async routeRequest(prompt, options = {}):
1. scoring = options.scoringFn ?? scoreIntent
2. decision = scoring(prompt, options)
3. chainOrder = orderedChain(decision.scores)
4. attempts: FallbackAttempt[] = []
5. timeoutMs = readTimeoutEnv()
6. perModelFn = options.completionFnRegistry ?? {}
7. globalFn = options.completionFn // legacy / Phase 0 test compat
8. for modelId of chainOrder:
a. resetIfElapsed(modelId)
b. if isOpen(modelId):
attempts.push({ model: modelId, error: new CircuitOpenError(modelId) })
continue
c. adapter = perModelFn[modelId] ?? globalFn ?? defaultAdapterFor(modelId, options.tools)
d. if adapter is undefined:
attempts.push({ model: modelId, error: new NoAdapterError(modelId) })
continue
e. try:
upstream = await raceWithTimeout(
adapter(prompt, projectUpstreamOptions(options, modelId)),
timeoutMs,
)
recordSuccess(modelId)
return freeze({
model: modelId,
content: upstream.content,
finishReason: upstream.stopReason,
promptTokens: upstream.promptTokens,
completionTokens: upstream.completionTokens,
latencyMs: upstream.latencyMs,
})
catch err:
recordFailure(modelId)
attempts.push({ model: modelId, error: normalize(err) })
9. throw new FallbackChainExhaustedError(attempts)
4.2. orderedChain(scores)
Sort entries by score descending. Ties broken by ASCII ascending on model_id (the scoring layer already enforces this for the winner; we re-apply the comparator for the full ordering). Returns a ReadonlyArray<ModelId> whose first element matches decision.winner.
function orderedChain(scores: Readonly<Record<ModelId, number>>): readonly ModelId[] {
return (Object.keys(scores) as ModelId[])
.sort((a, b) => {
const da = scores[a] ?? 0;
const db = scores[b] ?? 0;
if (da !== db) return db - da; // descending
return a < b ? -1 : a > b ? 1 : 0; // ASCII asc tie-break
});
}
4.3. defaultAdapterFor(modelId, tools)
Static adapter registry keyed by ModelId. Built once at module load.
const REGISTRY: Partial<Record<ModelId, CompletionFn>> = {
'claude': (p, o) => createCompletion(p, o),
'claude-sonnet-3-5': (p, o) => createCompletion(p, o),
'claude-haiku-3-5': (p, o) => createCompletion(p, o),
'kimi-k2': (p, o) => createKimiCompletion(p, o),
'gpt-4o': (p, o) => createOpenAiCompletion(p, o),
'gpt-4o-mini': (p, o) => createOpenAiCompletion(p, o),
};
When tools is non-empty AND the resolved modelId is a Claude variant (claude / claude-sonnet-3-5 / claude-haiku-3-5), dispatches to createCompletionWithTools instead of createCompletion. For non-Claude modelIds with tools, the registry returns the plain entry (tools forwarding across non-Claude adapters is P1.5.6+ scope per audit §4); a small logged warning is emitted so an operator knows the tools were dropped — but the call still proceeds.
Returns undefined for ModelIds without a registered adapter. The chain walk records that as a NoAdapterError attempt and moves on.
4.4. projectUpstreamOptions(options, modelId)
Same as Phase 0, with one conditional change: when options.model is absent AND the resolved modelId is the abstract 'claude' (or a Claude variant), no model field is emitted (adapter default applies). For non-Claude adapters, the abstract modelId is emitted as options.model so the adapter has a hint about which provider variant to use. This is conservative — it preserves Phase 0 behavior for the Claude path while letting non-Claude adapters select a sensible model from the abstract ID.
Other fields (maxTokens, systemPrompt, apiKey, fetchFn, logger, delayFn) are forwarded byte-identically to Phase 0.
4.5. raceWithTimeout(promise, timeoutMs)
async function raceWithTimeout<T>(p: Promise<T>, ms: number): Promise<T> {
let timerId: ReturnType<typeof setTimeout> | undefined;
const timeoutPromise = new Promise<never>((_, reject) => {
timerId = setTimeout(() => reject(new RouterTimeoutError(ms)), ms);
});
try {
return await Promise.race([p, timeoutPromise]);
} finally {
if (timerId !== undefined) clearTimeout(timerId);
}
}
setTimeout is inside the Promise.race guard — the only setTimeout in fallback.ts. The dispatch packet forbids any setTimeout outside this guard.
AbortController note: the dispatch packet PR-body bullet calls for an AbortController to cancel the adapter call on timeout, but the current adapters (createCompletion, createKimiCompletion, createCodexCompletion, createOpenAiCompletion) do NOT take a signal option. Wiring AbortController through to the adapters is a W3+ change that would require modifying every adapter — and the dispatch forbiddens prohibit adapter edits. So P1.5.5 ships with the Promise.race cancellation semantic only: the slow adapter promise resolves into a no-op when the race already settled. This satisfies “no leaks at the router boundary” (clearTimeout always called in finally) while leaving the upstream socket lifecycle to the adapter — which already manages it via its own retry / fetch lifecycle. This is a conscious deferral; a follow-up task (P1.5.6 or later) may extend adapters to accept a signal.
4.6. readTimeoutEnv()
function readTimeoutEnv(): number {
const raw = process.env['COLIBRI_MODEL_TIMEOUT'];
if (raw === undefined || raw === '') return 30_000;
const parsed = Number.parseInt(raw, 10);
if (!Number.isFinite(parsed) || parsed <= 0) return 30_000;
return parsed;
}
Read at every routeRequest call (env var is per-process, but Jest tests muck with process.env per-test so a fresh read avoids stale values).
4.7. normalize(err)
function normalize(err: unknown): Error {
return err instanceof Error ? err : new Error(String(err));
}
Same as Phase 0 — preserves AC17.
5. New error classes (internal to chain walk)
5.1. RouterTimeoutError
export class RouterTimeoutError extends Error {
readonly code = 'ROUTER_TIMEOUT' as const;
readonly timeoutMs: number;
constructor(timeoutMs: number) {
super(`δ router attempt timed out after ${timeoutMs} ms`);
this.name = 'RouterTimeoutError';
this.timeoutMs = timeoutMs;
}
}
Raised when Promise.race settles via the timer branch. Wrapped into FallbackAttempt.error. Treated as a failure (counts toward CB threshold).
5.2. CircuitOpenError
export class CircuitOpenError extends Error {
readonly code = 'CIRCUIT_OPEN' as const;
readonly modelId: ModelId;
constructor(modelId: ModelId) {
super(`δ circuit open for model='${modelId}'`);
this.name = 'CircuitOpenError';
this.modelId = modelId;
}
}
Raised inline when isOpen(modelId) === true. Appended to attempts so the all-tripped path still produces a non-empty array. Does NOT call recordFailure (the breaker is already tripped).
5.3. NoAdapterError
export class NoAdapterError extends Error {
readonly code = 'NO_ADAPTER' as const;
readonly modelId: ModelId;
constructor(modelId: ModelId) {
super(`δ no adapter registered for model='${modelId}'`);
this.name = 'NoAdapterError';
this.modelId = modelId;
}
}
Raised inline when the registry returns undefined for a chain member. Appended to attempts; not counted toward CB (the breaker tracks adapter failures, not registry absence).
6. Invariants (top-level, P1.5.5)
| ID | Invariant | Replaces |
|---|---|---|
| I1 | routeRequest signature byte-identical to Phase 0. Return shape unchanged. |
(preserved from P0.5.2) |
| I2 | Chain order derived from scoreIntent descending; ASCII tie-break. |
(new) |
| I3 | Per-attempt timeout: 30s default, configurable via COLIBRI_MODEL_TIMEOUT. |
(new) |
| I4 | 3 consecutive failures on a modelId open a 60s cooldown. |
replaces P0 I8 (no CB) |
| I5 | Cooldown is time-bound. resetIfElapsed clears state when 60s passed. |
(new) |
| I6 | A model that fails but is NOT tripped is retried in the next request. | (new) |
| I7 | All chain members tripped or failed ⇒ FallbackChainExhaustedError(attempts) with attempts.length === N. |
replaces P0 I5 |
| I8 | ROUTER_PHASE_0_SHAPE.members = N, .hasCircuitBreaker = true, .modelsSupported = readonly [...]. |
flips P0 I11 |
| I9 | getCircuitBreakerState() returns a frozen snapshot for observability. |
(new) |
| I10 | resetCircuitBreaker(modelId?) clears state for one or all models. |
(new) |
| I11 | No DB persistence of CB state — in-memory only. | (new) |
| I12 | No setTimeout outside Promise.race guard. |
(new) |
| I13 | No new MCP tool registered (P1.5.7 scope). | (preserved from P0 I9) |
| I14 | Tools passthrough preserved for Claude path. | (preserved from P0 I13) |
| I15 | Non-Error thrown values normalized via new Error(String(err)). |
(preserved from P0 AC17) |
| I16 | RouteResult is frozen, with same field shape as Phase 0. |
(preserved from P0 AC4) |
| I17 | FallbackChainExhaustedError.cause points to last attempt’s error. |
(preserved from P0) |
| I18 | attempts[i].model matches the attempt order (chain order). |
(new — Phase 0 had only 1 attempt) |
| I19 | Wave 3 fold-in: src/domains/router/index.ts re-exports ./adapters/{codex,kimi,openai}.js. |
(new) |
7. Acceptance criteria
| AC | Description | Test target |
|---|---|---|
| AC1 | Happy path: scoring puts claude first, adapter succeeds → RouteResult{model:'claude'}. |
preserved |
| AC2 | scoreIntent consulted exactly once per routeRequest. |
preserved |
| AC3 | Cascade: A fails, B succeeds → RouteResult{model:'B'}. Both adapters called. |
NEW |
| AC4 | Chain exhaustion: every adapter fails → FallbackChainExhaustedError with attempts.length === N. |
NEW (replaces P0 AC8) |
| AC5 | attempts[i].model reflects walk order. |
NEW |
| AC6 | CB trip: 3 consecutive failures on model X → 4th call skips X. | NEW |
| AC7 | CB time-bound reset: after 60s elapsed (via injected nowFn), model X reattempted. |
NEW |
| AC8 | All-tripped path: every model open → FallbackChainExhaustedError with attempts[i].error instanceof CircuitOpenError. |
NEW |
| AC9 | Per-attempt timeout: adapter hangs > 30s → RouterTimeoutError recorded, next chain member tried. |
NEW |
| AC10 | COLIBRI_MODEL_TIMEOUT env var override → custom timeout applied. |
NEW |
| AC11 | getCircuitBreakerState() returns frozen snapshot. |
NEW |
| AC12 | resetCircuitBreaker(modelId) clears single-model state. |
NEW |
| AC13 | resetCircuitBreaker() (no arg) clears all state. |
NEW |
| AC14 | ROUTER_PHASE_0_SHAPE.members === N (≥4), .hasCircuitBreaker === true, .modelsSupported.length === N. |
flipped |
| AC15 | Wave 3 fold-in: index.ts re-exports createKimiCompletion, createCodexCompletion, createOpenAiCompletion (smoke import). |
NEW |
| AC16 | Tools passthrough preserved for Claude. | preserved |
| AC17 | Non-Error thrown values wrapped (preserved AC17). | preserved |
| AC18 | RouteResult is frozen (preserved). |
preserved |
| AC19 | FallbackChainExhaustedError.message mentions count and last attempt error message. |
preserved/extended |
8. Forbiddens
- No MCP tool registration.
- No DB persistence of CB state.
- No
setTimeoutoutsidePromise.race. - No adapter file edits.
- No
costUsd/modelsAttemptedfields appended toRouteResult(P1.5.6 scope). - No
AMS_*env var reads.COLIBRI_MODEL_TIMEOUTonly. - No main-checkout edits.
9. Contract close
Behavioral contract complete. Ready to write the execution packet (Step 3).