δ Model Router — Algorithm Extraction

Algorithmic content extracted from AMS src/claude/router/smart.js for Colibri implementation reference.

8-Model Matrix

Models registered in src/config/claude.js (as of R45):

Model ID	Generation	Class	Alias
`claude-opus-4-6`	Claude 4	Most capable, complex reasoning	`opus`
`claude-sonnet-4-6`	Claude 4	Balanced speed/capability	`sonnet`, `default`
`claude-haiku-4-5-20251001`	Claude 4	Fastest	`haiku`
`claude-3-5-sonnet-20241022`	Claude 3.5	Previous-gen balanced	—
`claude-3-5-haiku-20241022`	Claude 3.5	Previous-gen fast	—
`claude-3-opus-20240229`	Claude 3	Previous-gen capable	—
`claude-3-sonnet-20240229`	Claude 3	Previous-gen balanced	—
`claude-3-haiku-20240307`	Claude 3	Previous-gen compact	—

The costPer1kTokens field is a routing weight (lower = preferred for cost-sensitive routing), not a live billing rate.

Scoring Heuristic (calculateModelScore)

The router uses a heuristic accumulator in src/claude/router/smart.js. Each candidate model is scored and the highest wins.

Important: The simplified quality*0.40 + cost*0.30 + latency*0.20 + load*0.10 formula appears in design docs but is NOT the implementation. The actual algorithm:

score = 0.5   (base score for all candidates)

+ 0.30   if model is the PRIMARY pick for the detected intent
+ 0.15   if model is in the FALLBACK list for the detected intent

+ 0.10   per matched model strength (coding, speed, reasoning)
         e.g., task has code → model has "coding" strength → +0.10

+ 0.20   if task complexity == "very_high" AND model supports extended thinking

+ 0.15   if request contains code AND model excels at coding

+ cost_rank_bonus   if priority == "cost"
                    additive bonus based on cost weight rank (lower cost = higher bonus)

+ 0.20   if priority == "speed" AND model has "speed" strength

- 0.50   if estimated_token_count > 0.90 * model.contextWindow
          (hard penalty for near-overflow)

score = clamp(score, 0.0, 1.0)

Default fallback: If no model scores above 0.0, select claude-3-5-sonnet-20241022.

Weighted Scoring Summary Table

Factor	Addend	Condition
Base	+0.50	Always
Intent: primary	+0.30	Model is top pick for detected intent
Intent: fallback	+0.15	Model is secondary pick for intent
Strength match	+0.10 per match	Model strength matches task signal
Complexity bonus	+0.20	Complexity = very_high AND extended thinking supported
Code bonus	+0.15	Request has code AND model excels at coding
Cost preference	+variable	priority = “cost”; rank-based additive
Speed preference	+0.20	priority = “speed” AND model has speed strength
Context overflow	−0.50	Estimated tokens > 90% of model context window

Fallback Chain

When the selected (highest-scoring) model fails:

Step 1: RETRY
  Same model, exponential backoff
  Max 2 retry attempts
  Backoff: 100ms, 200ms

Step 2: ESCALATE (intra-provider)
  Next-highest-scoring model from the SAME provider (Anthropic)
  Re-run scoring, exclude failed model, pick top result

Step 3: CROSS-PROVIDER
  Highest-scoring model from a DIFFERENT provider
  (Currently Anthropic-only; cross-provider is designed for future providers)

Step 4: MANUAL RETRY
  If all candidates fail:
  Queue request for manual retry
  Log failure with model list and error reasons
  Return error to caller with retry_after hint

Intent Mapping

The router maintains an intent-to-model preference table. Detected intent determines which models receive the +0.30 primary bonus:

Task Type / Intent	Preferred Model	Rationale
Complex reasoning, long-form analysis	Opus class	Extended thinking support
Balanced task, code writing	Sonnet class	Speed/capability tradeoff
Simple retrieval, fast response	Haiku class	Lowest latency
Previous-gen compatibility	3.5 Sonnet/Haiku	Stable outputs
Cost-sensitive batch work	Haiku class	Lowest cost weight

Intent is detected from request context: task type field, presence of code blocks, token count estimate, complexity flag.

Context Window Check

Hard penalty applied before final selection:

estimated_tokens = len(prompt) / 4   (heuristic: ~4 chars per token)
                 + len(context_snapshot) / 4
                 + expected_output_tokens (default: 4096)

if estimated_tokens > 0.90 * model.contextWindow:
    score -= 0.50
    # This effectively eliminates the model from selection
    # unless ALL candidates would exceed their windows

Context window limits by model class:

Claude 4 Opus/Sonnet: 200K tokens
Claude 4 Haiku: 200K tokens
Claude 3.5 Sonnet/Haiku: 200K tokens
Claude 3 Opus: 200K tokens
Claude 3 Sonnet/Haiku: 200K tokens

6 Execution Mode Vocabulary

Design vocabulary only — NOT implemented as runtime constants.

The tool-chain engine (src/claude/tool-chain/engine.js) uses its own separate ChainMode enum (SEQUENTIAL | PARALLEL | CONDITIONAL | LOOP | DAG) for internal pipeline composition. The 6 modes below are design-level dispatch concepts, not runtime code:

Mode	Planned behavior
`SINGLE`	Route to one model, wait for complete response
`PARALLEL`	Send to N models simultaneously, return first or best response
`CHAINED`	Model A output feeds into Model B (e.g., draft → review)
`SWARM`	Multiple agents work on decomposed sub-tasks
`PLAN`	Model generates plan, human/PM approves, then execute
`COWORK`	Two models collaborate on shared context

When multiple models participate (CHAINED, PARALLEL, COWORK):

shared_context = {
  conversation_history: [...serialized messages],
  task_metadata: { id, phase, priority, type },
  memory_snapshots: [...relevant memory_short_term entries],
  model_formatting: { per-model instruction prefixes }
}

Before dispatch to each model:
  trim shared_context to fit model.contextWindow
  apply model-specific formatting from surface adapters in src/claude/