How Does the System Scale Behavior Through Agents and Contracts?

Как система масштабирует поведение через агентов и контракты?

⚠ Phase 0 reality stamp (R74.5). Phase 0 scales through task decomposition + Task-tool sub-agents, not via an MCP-level agent registry. Sub-agents are spawned from a parent Claude session using the Task tool into isolated .worktrees/claude/<task-slug> feature worktrees; each sub-agent runs the 5-step chain (audit → contract → packet → implement → verify) and must writeback via task_update { progress: 100 } + task_transition { to: "DONE" } + thought_record { thought_type: "reflection" } before merkle_finalize. The donor agent_spawn, agent_status, agent_list tools and the entire src/domains/agents/ target are deferred to Phase 1.5 per ADR-005. Phase 0 is single-writer SQLite — one node per deployment. Any reference below to an agent pool, pool strategies, agent_spawn, or skill_get hot-reload describes the donor AMS runtime, not Colibri Phase 0. Canonical values live in colibri-system.md §2.

The Fundamental Scaling Challenge

Colibri is a single-process, single-writer MCP server. A single process cannot execute multiple complex tasks in parallel without distributing work somewhere. But distributing work — spinning up sub-processes, delegating to external agents, managing their lifecycles — creates new problems:

  1. Work loss: If an agent crashes, what happens to its results? How do we know it was working?
  2. Progress loss: If we restart the server, how do we resume a half-finished workflow without losing state?
  3. Resource exhaustion: If we spawn too many agents at once, memory explodes or task latency tanks.
  4. Verification: How do we trust that a sub-agent did the work correctly and reported honestly?

Colibri solves these problems through three mechanisms:

  1. Skills — the unit of reusable capability
  2. Phases — the unit of sequential work
  3. Contracts — the unit of mutual obligation between agent and workflow

Together, these mechanisms transform Colibri from a single-task executor into a multi-agent task orchestrator that scales correctly.


1. The Unit of Scale: The Skill

A skill is a reusable, versioned tool-call sequence defined in a SKILL.md file. It is the atomic unit of capability.

Why Skills Matter

Without skills, scaling fails:

  • You’d need to write the same tool-call sequence repeatedly in different agents
  • Testing and verification would explode in complexity
  • Agents couldn’t be interchangeable — each would have its own unique instruction set
  • Knowledge couldn’t accumulate — every new agent reinvents the wheel

With skills:

  • A sequence like “check code quality → run tests → lint → report results” is defined once
  • Any agent (research, planning, implementation) can execute it deterministically
  • The skill’s verification step is identical across all executions
  • Skills compose — a skill can call other skills

Skill Structure

Each skill lives in its own directory:

.agents/skills/
  ├── colibri-audit-proof/
  │   ├── SKILL.md         — skill definition
  │   └── references/      — supporting docs (tool lists, templates)
  ├── colibri-gsd-execution/
  │   ├── SKILL.md
  │   └── references/
  └── ... (22 total skills)

A SKILL.md defines:

---
skill_name: audit-proof
triggers:
  - task.type == "audit"
  - task.complexity > "medium"
required_tools:
  - thought_record
  - audit_verify
  - memory_pack
workflow:
  - step 1: call thought_record(...)
  - step 2: call audit_verify(...)
  - step 3: call memory_pack(...)
verification:
  - proof_chain_valid == true
  - all_hashes_match == true
---

The 22 Skills (6 Tiers)

Tier Count Skills Purpose
PM & Orchestration 2 project-manager, tier1-chains Coordinate workflows, hand off between phases
Task & Roadmap 2 task-management, roadmap-progress CRUD on tasks, track milestones
GSD & Execution 2 gsd-execution, autonomous Run workflows, execute phases
Audit & Proof 3 audit-proof, memory-context, verification Build audit trails, generate proofs
Infrastructure 2 mcp-server, observability Server ops, monitoring
Integration 1 obsidian-integration External system sync

Note: This is the target design. Phase 0 has not yet begun; zero TypeScript code exists.

Why This Tier Structure Works

  • Lower tiers (PM, Task, GSD) are high-frequency, high-visibility — executed most often
  • Middle tiers (Audit, Infrastructure) are support layers — called by lower tiers or on demand
  • Upper tiers (Integration) are specialized — called rarely, only when external sync is needed

This hierarchy ensures that the most-tested, most-stable skills run the core loop, while specialized skills are isolated and can be upgraded independently.


2. The Unit of Work: The Phase

A phase is a sequential stage in a multi-phase workflow. Phase N+1 does NOT start until phase N completes.

Why Phases Matter

Without phases, scaling collapses:

  • All agents would run in parallel, and you’d have no way to enforce dependencies
  • Earlier results wouldn’t be available to later tasks
  • Rollback would be impossible — later tasks might depend on earlier results
  • Testing each stage independently would be impossible

With phases:

  • Each phase has a clear input (results from phase N-1) and output (results for phase N+1)
  • You can verify each phase independently
  • If phase 3 fails, you can restart from phase 3 without re-running phases 1 and 2
  • Each phase can use a different agent type, optimized for that kind of work

The 5-Phase Workflow

A typical multi-agent workflow in Colibri looks like:

Workflow ID: wf-2024-001
├── Phase 1: audit (research agent)
│   Input: task description, dependencies
│   Output: audit report, risk assessment
│   Agent type: RESEARCH
│
├── Phase 2: contract (roadmap agent)
│   Input: audit report
│   Output: execution plan, phase breakdown
│   Agent type: ROADMAP
│
├── Phase 3: execution packet (planning agent)
│   Input: execution plan
│   Output: task assignments, resource budget
│   Agent type: PLANNING
│
├── Phase 4: implementation (coder agent)
│   Input: task assignments
│   Output: code, tests, documentation
│   Agent type: IMPLEMENTATION
│
└── Phase 5: verification (reviewer agent)
    Input: code, tests, documentation
    Output: verification report, Merkle proof
    Agent type: REVIEWER

Each phase is represented in the database:

interface WorkflowPhase {
  id: string;                          // 'phase-5-001'
  workflow_id: string;                 // 'wf-2024-001'
  phase_num: number;                   // 1, 2, 3, 4, 5
  agent_type: "RESEARCH" | "ROADMAP" | "PLANNING" | "IMPLEMENTATION" | "REVIEWER";
  status: "PENDING" | "RUNNING" | "COMPLETED" | "FAILED";
  result: {
    output_hash: string;               // SHA256 of phase results
    intermediate_results: Record<string, any>;
    elapsed_time_ms: number;
  };
  created_at: number;                  // epoch ms
  completed_at?: number;               // epoch ms (set when status changes to COMPLETED)
}

Why Phases Are Sequential

The design enforces: phase N+1 starts only when phase N completes (COMPLETED or FAILED).

This is not a limitation — it’s a feature. Here’s why:

  1. Determinism: If phases run in parallel, different interleavings could produce different results. By running sequentially, you get one canonical ordering.
  2. Dependency clarity: Phase 4 (implementation) depends on phase 3 (execution packet). If phase 3 fails, phase 4 doesn’t start — no wasted work.
  3. Rollback safety: If phase 5 (verification) fails, you can mark the workflow as failed and restart from phase 1. The database is clean.
  4. Testing isolation: You can stub out phases 2–5 and test phase 1 in isolation. Then stub out phase 1 and test phase 2, etc.

Checkpointing: Restart-Safety

At each phase boundary, Colibri writes a checkpoint to the database:

INSERT INTO workflow_checkpoints (workflow_id, phase_num, checkpoint_data)
VALUES (
  'wf-2024-001',
  3,
  JSON_OBJECT(
    'phase_results', <JSON>,
    'elapsed_time_ms', 180000,
    'agent_id', 'agent-planning-001',
    'memory_pack', <compressed memory>
  )
);

If the server crashes at any point, recovery looks up the latest checkpoint and resumes from there. No work is lost; no phase runs twice.


3. The Contract: Writeback

The writeback contract is the binding agreement between a parent workflow and its child agent:

“Before you terminate, you must produce task_update + thought_record. If you don’t, you will be flagged as orphaned and the workflow will escalate.”

The 3-Item Writeback Contract

Every agent MUST produce these three outputs before termination:

  1. task_update — status (done/failed/blocked), progress (0–100%), summary
  2. thought_record — task_id, branch name, commit SHA, tests run, blockers
  3. memory_pack (optional but recommended) — compress working memory into long-term store

Why This Contract Exists

Without this contract, you have no way to know if an agent:

  • Actually completed its work or just exited gracefully
  • Produced results or failed silently
  • Left the worktree in a valid state or crashed mid-operation

With the contract, you have a proof that the agent did real work:

  • task_update proves the agent knows the outcome
  • thought_record proves the agent wrote its reasoning to the audit trail
  • The combination lets you verify every intermediate step

Data Shapes

// task_update (MCP tool call)
interface TaskUpdateParams {
  task_id: string;
  status: "done" | "failed" | "blocked";
  progress: number;  // 0–100
  summary: string;   // one-line summary
}

// thought_record (MCP tool call)
interface ThoughtRecordParams {
  task_id: string;
  branch: string;           // git branch name
  commit_sha: string;       // git commit SHA
  tests_run: number;        // count of tests
  tests_passed: number;     // count passing
  blockers: string[];       // array of blocking issues
}

// memory_pack (MCP tool call)
interface MemoryPackParams {
  memory_json: string;  // compressed JSON of working memory
  retention_level: "short_term" | "medium_term" | "long_term";
  ttl_epochs?: number;  // expiry (optional)
}

Enforcement: Convention vs. Hard Block

The writeback contract is enforced at convention level, not runtime. This means:

  • Agents that skip writeback are not blocked by the system
  • Instead, they are flagged as orphaned in a periodic scan
  • Warnings are logged: WARN [recovery] Agent #agent-impl-042 terminated without writeback
  • The parent workflow escalates: marks the phase as FAILED, triggers fallback

Why convention, not hard block?

Because the system cannot force an agent to call task_update if the agent’s process crashes, network connection dies, or the code is buggy. A hard block would only cause deadlocks, not prevent orphaned agents. Flagging + escalation is more honest: “We detected you didn’t report back; we’re treating this as failure.”

Orphan Detection Recovery

Every 60 seconds, a recovery process:

  1. Scans the agents table for agents in state BUSY whose last heartbeat > 5 minutes ago
  2. Checks the mcp_thought table for recent thought_record entries from those agents
  3. If no recent thought_record: marks agent as FAILED, logs warning, triggers workflow escalation
-- Recovery scan pseudocode
SELECT a.id, a.task_id, a.last_heartbeat
FROM agents a
WHERE a.state = 'BUSY'
  AND a.last_heartbeat < (NOW() - INTERVAL '5 minutes')
  AND NOT EXISTS (
    SELECT 1 FROM mcp_thought
    WHERE agent_id = a.id
      AND created_at > a.last_heartbeat
  );

4. The Pool: Agent Distribution

An agent pool is a group of agents assigned to handle a particular phase. The pool distributes incoming work according to a configurable strategy.

The 5 Pool Strategies

Strategy When to use Example
FIFO Low variance, predictable load Research agents reading documents in order
PRIORITY_QUEUE Mixed priorities (urgent vs backlog) Implementation queue with hot fixes first
ROUND_ROBIN Load balancing across identical agents Multiple code-review agents
LEAST_LOADED Minimize agent idle time Verification agents, each with different speeds
CAPACITY_AWARE Agents have different capability levels Mix of senior and junior planners

Pool Configuration

Each phase specifies:

interface AgentPoolConfig {
  workflow_id: string;
  phase_num: number;
  strategy: "FIFO" | "PRIORITY_QUEUE" | "ROUND_ROBIN" | "LEAST_LOADED" | "CAPACITY_AWARE";
  min_size: number;              // minimum agents to keep alive
  max_size: number;              // maximum agents to spawn
  agent_type: "RESEARCH" | "ROADMAP" | "PLANNING" | "IMPLEMENTATION" | "REVIEWER";
}

Auto-Scaling

The pool size adjusts dynamically based on:

interface AutoScalingMetrics {
  queue_depth: number;           // tasks waiting
  current_load: number;          // active tasks / pool_size
  throughput: number;            // tasks/second (last 60s)
  p99_latency: number;           // 99th percentile latency
}

// Scaling decision:
// if queue_depth > (current_load * 1.5):
//   new_size = min(max_size, current_size + 1)
// elif queue_depth < (current_load * 0.5) && current_size > min_size:
//   new_size = max(min_size, current_size - 1)

If tasks are backing up in the queue, spawn more agents. If agents are idle, shrink the pool.

CV: Capability Profile

Each agent has a CV (capability profile) that describes what it can do:

interface AgentCV {
  id: string;                           // 'agent-impl-042'
  type: "RESEARCH" | "ROADMAP" | "PLANNING" | "IMPLEMENTATION" | "REVIEWER";
  skills: string[];                     // ['gsd-execution', 'code-review']
  permissions: string[];                // which tools it can call
  limits: {
    max_concurrent_tasks: number;       // usually 1
    token_budget: number;               // per-task token limit
    timeout_seconds: number;            // max execution time
  };
  history: {
    success_rate: number;               // 0.0–1.0
    avg_duration_ms: number;
    total_tasks_completed: number;
  };
}

The task router (β) uses the CV to decide: “Which agent should handle this task?” An urgent_important code review goes to the highest-success-rate REVIEWER agent.

The 6 Agent Lifecycle States

PENDING → INITIALIZING → READY → BUSY → TERMINATED
   ↓           ↓          ↓       ↓
 FAILED     FAILED      FAILED  FAILED
State Meaning Transitions
PENDING Agent ID allocated; process not yet started → INITIALIZING (if spawned) or → TERMINATED (if cancelled)
INITIALIZING Loading skills, setting up worktree, verifying permissions → READY (on success) or → FAILED (if setup error)
READY Idle, waiting for assignment → BUSY (task assigned) or → TERMINATED (if pool shrinks)
BUSY Executing a task → READY (on completion) or → FAILED (if error)
FAILED Error state; may be retried or escalated → PENDING (if retry) or → TERMINATED (if max retries exceeded)
TERMINATED Execution complete; resources released; agent ID recycled (final state)

Transitions are atomic and logged:

INSERT INTO agent_state_transitions (agent_id, from_state, to_state, reason, timestamp)
VALUES ('agent-impl-042', 'BUSY', 'READY', 'task_completed', NOW());

5. The Admission Gate: Rate Limiting at Scale

Before a task can enter the system, it must pass through the admission layer implemented by κ (Rule Engine). This is where Colibri limits throughput and prevents resource exhaustion.

Token Bucket Per Event Type

Each event type (task_create, task_update, thought_record, etc.) gets a token bucket:

interface TokenBucket {
  event_type: string;              // 'task_create', 'thought_record', etc.
  capacity: number;                // max tokens
  refill_rate: number;             // tokens per second
  current_tokens: number;          // tokens available now
}

// Example: task_create bucket
// capacity: 100
// refill_rate: 5 per second
// If you call task_create 100 times instantly, the 101st call is rate-limited
// But after 20 seconds, you have 100 tokens again

This prevents a single caller from monopolizing the system.

Reputation as Backpressure

Callers with higher reputation get higher token bucket capacity:

// Pseudocode
if (caller_reputation < MIN_REPUTATION_FOR_ACTION) {
  // Reject: not enough reputation
  throw new Error("Insufficient reputation to create tasks");
}

// Allow, but use their reputation as the token bucket multiplier
const bucket_capacity = BASE_CAPACITY * (caller_reputation / MAX_REPUTATION);

A brand-new user (reputation = 0) gets a tiny bucket. A trusted system (reputation = 10,000) gets a large bucket. This is natural backpressure: the system trusts high-reputation callers and limits untrusted ones.

VRF Audit: 5% Sampling

Not every event is verified in full. That would be too expensive. Instead:

// On event admission:
const vrf_score = compute_vrf(event_id, epoch);
if (vrf_score % 100 < 5) {  // 5% chance
  // Full verification: check all constraints, hashes, signatures
  audit_verify(event);
} else {
  // Quick check only: reputation + token bucket
  admit_quick(event);
}

This ensures that:

  • 95% of events are admitted quickly
  • 5% are audited deeply
  • An attacker cannot predict which events will be audited (VRF is unpredictable)
  • Statistically, any sustained attack will be caught

Stake Freeze at Admission

When a high-stakes task is admitted, a portion of the caller’s stake is frozen:

// On task_create with stake_required = 1000
const caller_stake = get_stake(caller);
if (caller_stake.available < 1000) {
  throw new Error("Insufficient stake");
}

// Freeze the stake
freeze_stake(caller, 1000);
// stake.available -= 1000
// stake.frozen += 1000

// On task_done, release the stake
release_stake(caller, 1000);
// stake.frozen -= 1000
// stake.available += 1000

This ensures the caller has “skin in the game” — they lose real resources if they abuse the system.


6. Intelligence Scaling: The Model Router (δ)

Colibri is not a single AI model. It is a router that distributes tasks across 8 AI model candidates and selects the best fit for each job.

The 8 Model Candidates

  • Claude 3.5 Sonnet (best general reasoning)
  • Claude 3 Opus (complex reasoning, longer context)
  • Claude 3 Haiku (fast, cheap, limited tasks)
  • GPT-4 Turbo (alternatives for vendor lock-in mitigation)
  • GPT-4o (vision tasks)
  • Gemini Pro (cost optimization)
  • Llama 2 (on-prem, compliance)
  • Mixtral (specialized domains)

Intent-Driven Scoring

When a task arrives, the model router scores each candidate on:

Dimension Example Scoring
Task complexity “Summarize this doc” vs “Prove this theorem” Low complexity → cheap model; high → expensive model
Domain expertise “code review” vs “legal analysis” Legal tasks → GPT-4 Turbo (better training); code → Claude
Token budget Budget = 10K tokens total Models with lower cost-per-token win
Latency tolerance “Return results in 5 seconds” vs “return in 1 hour” Tight deadline → fast model; loose → slower model
interface ModelRoutingScore {
  model: string;
  complexity_score: number;        // 0–100
  expertise_score: number;         // 0–100
  cost_efficiency: number;         // 0–100
  latency_fit: number;             // 0–100
  composite_score: number;         // weighted average
}

The router selects the model with the highest composite_score.

Feedback Loop: Improving Over Time

Every execution is logged:

interface RoutingDecision {
  task_id: string;
  selected_model: string;
  routing_score: RoutingScore;
  actual_latency_ms: number;
  actual_cost_tokens: number;
  result_quality: number;          // 0–100 (from verification step)
  created_at: number;
}

Periodically, the system analyzes this log:

For each (task_type, selected_model) pair:
  feedback = (quality - expected_quality) / expected_quality
  if feedback > 0.1:
    // This model did better than expected
    model_weights[model] *= (1 + 0.05)
  elif feedback < -0.1:
    // This model underperformed
    model_weights[model] *= (1 - 0.05)

Over time, the router learns which models work best for which tasks, and its routing decisions improve.


7. A Multi-Agent Workflow: Traced

Let’s walk through a real 5-phase workflow to see how all these mechanisms work together.

Setup

Task: “Review this pull request and approve if it meets standards”

Workflow:

wf-2024-pr-review
├── Phase 1: audit (security review)
├── Phase 2: contract (compliance check)
├── Phase 3: execution (code quality review)
├── Phase 4: implementation (feature validation)
└── Phase 5: verification (final sign-off)

Phase 1: Audit (T+0s)

  1. Task arrives → task_create called → task ID allocated: task-pr-001
  2. β (Task Pipeline) creates workflow record: wf-2024-pr-review
  3. Phase 1 pool (RESEARCH agents) is allocated: min_size=1, max_size=2, strategy=PRIORITY_QUEUE
  4. Agent spawning: ε (Skill Registry) calls gsd_agent_spawn:
    Returns agent ID: agent-research-1001
    
  5. Agent startup:
    • State: PENDING → INITIALIZING
    • Load skill: ‘audit-proof’
    • Verify permissions: can call thought_record, audit_verify, memory_pack
    • Set up worktree: git checkout -b audit/pr-001
    • State: INITIALIZING → READY → BUSY
  6. Skill execution: audit-proof workflow runs:
    step 1: thought_record(task_id, description)
    step 2: audit_verify(pr_diff, security_checks)
    step 3: memory_pack(findings)
    
  7. Checkpointing: At phase boundary, write checkpoint:
    INSERT INTO workflow_checkpoints
    VALUES (wf-2024-pr-review, 1, { audit_findings: {...} })
    
  8. Writeback contract:
    • Agent calls: task_update(status=done, progress=100)
    • Agent calls: thought_record(task_id, branch=’audit/pr-001’, commit_sha=’abc123’, tests_run=5, tests_passed=5, blockers=[])
    • Agent calls: memory_pack(…)
  9. Phase 1 completes (T+30s) → agent state: BUSY → READY

Phase 2: Contract (T+30s)

  1. β checks phase 1 result: status = COMPLETED ✓
  2. Phase 2 pool (ROADMAP agents) is allocated: min_size=1, max_size=1, strategy=FIFO
  3. Agent spawning: ε calls gsd_agent_spawn:
    Returns agent ID: agent-roadmap-2001
    
  4. Agent startup: State flow PENDING → INITIALIZING → READY → BUSY
  5. Skill execution: roadmap-progress workflow runs with phase 1 results as input
  6. Checkpointing: checkpoint(wf-2024-pr-review, 2, {…})
  7. Writeback contract fulfilled
  8. Phase 2 completes (T+50s)

Phase 3: Execution (T+50s)

Same pattern — agent-planning-3001 executes gsd-execution skill, produces results, writes checkpoint.

Phase 4: Implementation (T+70s)

Agent-impl-4001 executes code-review skill. This is the longest phase (20 seconds).

Result: approval_status = “approved”, quality_score = 95.

Phase 5: Verification (T+90s)

Agent-reviewer-5001 executes verification skill:

step 1: thought_record(summary of all phases)
step 2: audit_verify(merkle proof of entire workflow)
step 3: memory_pack(complete execution trace)
step 4: thought_record(final sign-off)

Writeback contract:

task_update(status=done, progress=100, summary="PR approved: all phases passed")
thought_record(
  task_id=task-pr-001,
  branch=feature/pr-review,
  commit_sha=def456,
  tests_run=50,
  tests_passed=50,
  blockers=[]
)

Workflow Complete (T+100s)

UPDATE gsd_workflows SET status='COMPLETED', result_hash='...' WHERE id='wf-2024-pr-review';
INSERT INTO mcp_merkle (workflow_id, hash) VALUES ('wf-2024-pr-review', '...');

Summary:

  • 5 agents, 5 phases, 100 seconds total
  • Each agent executed a skill independently
  • Each phase’s output became the next phase’s input
  • Every agent produced task_update + thought_record
  • All results are auditable via the Merkle tree
  • If phase 4 had failed, we’d restart from phase 4 (checkpoint at phase 3), not from phase 1

8. Scale Limits in the Design

Single-Writer SQLite: The Bottleneck

Colibri uses SQLite with single-writer access. This means:

  • One process owns the database file at any given time
  • Read concurrency is possible (WAL mode)
  • Write operations are serialized by a tool-level lock (tool-lock middleware)

When does this matter?

  • Up to ~100 concurrent tasks: SQLite is fine. Write latency is ~1–5ms per operation.
  • 100–1000 concurrent tasks: SQLite becomes a bottleneck. Write latency climbs to 10–50ms.
  • 1000+ concurrent tasks: Effectively infeasible. WAL contention, checkpoint blocking, memory pressure.

Phase 0 scope: 50–100 concurrent tasks. Single-writer SQLite is sufficient.

Phase 3 scope (P3 includes θ Consensus): Multi-node P2P network, each node with its own SQLite instance. Consensus mechanisms coordinate state across nodes, eliminating the single-writer bottleneck.

Why Horizontal Scaling Is Not in Phase 0

To scale beyond single-writer SQLite, you would need:

  1. Distributed consensus — Multiple nodes agree on state without a central authority
  2. Eventual consistency — Nodes may temporarily diverge, then converge
  3. Conflict resolution — If two nodes produce different results, a tiebreaker decides

All of this is specified in θ (Consensus) but not implemented in Phase 0. It requires:

  • VRF randomness for fairness
  • Byzantine fault tolerance for security
  • Merkle proofs for audit
  • Reputation stakes for incentive alignment

θ is not implemented in Phase 0. Colibri Phase 0 is a single-node system. Phase 3 (P3.0–P3.4) will implement θ and enable true multi-node operation.

What Scales Anyway

Despite the single-writer limit, these aspects scale:

  1. Skill reuse — 22 skills × any number of agents
  2. Phase decomposition — tasks broken into 5 phases, each parallelizable within constraints
  3. Pool strategies — auto-scaling adjusts agent count based on queue depth
  4. Model routing — work distributed across 8 AI models, not just one
  5. VRF sampling — 95% of audit work deferred, only 5% done immediately
  6. Stake multiplier — reputation → larger token buckets → more throughput for trusted callers

A single phase with 10 agents can saturate to ~10 tasks in flight before hitting SQLite contention. But the serialized write still happens — it’s just that the agents are working in parallel on different tasks while the writes are batched/queued.


Summary Table: Scaling Mechanisms

Mechanism What It Is Enforced By Scales What Scale Limit
Skill Reusable tool-call sequence Convention (SKILL.md files) Agent capability reuse, knowledge accumulation 22 skills × infinite agents
Phase Sequential workflow stage β state machine Dependency ordering, checkpoint safety 5 phases per workflow (design allows N)
Writeback Contract Agent output guarantee (task_update + thought_record) Convention (orphan flagging) Workflow verification, result durability Every agent must fulfill it
Agent Pool Group of agents handling one phase Pool strategy configuration Distribution across agents, load balancing min_size to max_size
Auto-scaling Dynamic pool size adjustment Throughput metrics (queue depth, latency) Resource utilization Queue depth > load triggers spawn
CV Registry Agent capability profile Task router (β) Task-to-agent matching, capability visibility One CV per agent type
Token Bucket Rate limiting per event type κ admission layer Throughput fairness, spam prevention Capacity = BASE × (reputation / MAX)
Reputation Caller credibility score history + behavior Token bucket multiplier, stake multiplier 0–10,000 basis points
VRF Audit 5% random verification sampling κ rule engine Audit cost reduction (95% quick admit) 5% of events audited deeply
Stake Freeze Lock caller’s tokens at admission κ rule engine High-stakes task guarantee Caller must have stake ≥ required
Model Router Intent-driven AI selection δ (Intelligence layer) AI model diversity, cost optimization 8 model candidates
Routing Feedback Learning loop over routing decisions δ feedback mechanism Routing accuracy improvement model_weights updated per feedback

Why This Design Scales Correctly

  1. Skills reduce cognitive load — agents don’t reinvent the wheel; they reuse proven workflows
  2. Phases enforce dependencies — later work doesn’t start until earlier work is verified
  3. Contracts guarantee auditability — every agent must report results, or they’re flagged
  4. Pools distribute load — multiple agents can work in parallel within a phase
  5. Auto-scaling prevents exhaustion — pool size adjusts to queue depth, not fixed
  6. Token buckets prevent spam — high-reputation callers get higher throughput
  7. VRF sampling is efficient — 95% of events admit fast; 5% audit deep
  8. Model routing reduces cost — task-appropriate models minimize token spend
  9. Checkpointing ensures restart-safety — server crashes don’t lose progress

The system scales not by removing constraints, but by distributing work smartly within constraints.


See Also

[[concepts/index Concept Index]] · [[concepts/β-task-pipeline β Task Pipeline]] · [[concepts/ε-skill-registry ε Skill Registry]] · [[concepts/κ-rule-engine κ Rule Engine]] · [[concepts/δ-model-router δ Model Router]] · [[architecture/data-model Data Model]] · [[spec/s15-gsd-contract S15 GSD Contract]]

Back to top

Colibri — documentation-first MCP runtime. Apache 2.0 + Commons Clause.

This site uses Just the Docs, a documentation theme for Jekyll.