How Does the System Scale Behavior Through Agents and Contracts?
Как система масштабирует поведение через агентов и контракты?
⚠ Phase 0 reality stamp (R74.5). Phase 0 scales through task decomposition + Task-tool sub-agents, not via an MCP-level agent registry. Sub-agents are spawned from a parent Claude session using the Task tool into isolated
.worktrees/claude/<task-slug>feature worktrees; each sub-agent runs the 5-step chain (audit → contract → packet → implement → verify) and must writeback viatask_update { progress: 100 }+task_transition { to: "DONE" }+thought_record { thought_type: "reflection" }beforemerkle_finalize. The donoragent_spawn,agent_status,agent_listtools and the entiresrc/domains/agents/target are deferred to Phase 1.5 per ADR-005. Phase 0 is single-writer SQLite — one node per deployment. Any reference below to an agent pool, pool strategies,agent_spawn, orskill_gethot-reload describes the donor AMS runtime, not Colibri Phase 0. Canonical values live incolibri-system.md§2.
The Fundamental Scaling Challenge
Colibri is a single-process, single-writer MCP server. A single process cannot execute multiple complex tasks in parallel without distributing work somewhere. But distributing work — spinning up sub-processes, delegating to external agents, managing their lifecycles — creates new problems:
- Work loss: If an agent crashes, what happens to its results? How do we know it was working?
- Progress loss: If we restart the server, how do we resume a half-finished workflow without losing state?
- Resource exhaustion: If we spawn too many agents at once, memory explodes or task latency tanks.
- Verification: How do we trust that a sub-agent did the work correctly and reported honestly?
Colibri solves these problems through three mechanisms:
- Skills — the unit of reusable capability
- Phases — the unit of sequential work
- Contracts — the unit of mutual obligation between agent and workflow
Together, these mechanisms transform Colibri from a single-task executor into a multi-agent task orchestrator that scales correctly.
1. The Unit of Scale: The Skill
A skill is a reusable, versioned tool-call sequence defined in a SKILL.md file. It is the atomic unit of capability.
Why Skills Matter
Without skills, scaling fails:
- You’d need to write the same tool-call sequence repeatedly in different agents
- Testing and verification would explode in complexity
- Agents couldn’t be interchangeable — each would have its own unique instruction set
- Knowledge couldn’t accumulate — every new agent reinvents the wheel
With skills:
- A sequence like “check code quality → run tests → lint → report results” is defined once
- Any agent (research, planning, implementation) can execute it deterministically
- The skill’s verification step is identical across all executions
- Skills compose — a skill can call other skills
Skill Structure
Each skill lives in its own directory:
.agents/skills/
├── colibri-audit-proof/
│ ├── SKILL.md — skill definition
│ └── references/ — supporting docs (tool lists, templates)
├── colibri-gsd-execution/
│ ├── SKILL.md
│ └── references/
└── ... (22 total skills)
A SKILL.md defines:
---
skill_name: audit-proof
triggers:
- task.type == "audit"
- task.complexity > "medium"
required_tools:
- thought_record
- audit_verify
- memory_pack
workflow:
- step 1: call thought_record(...)
- step 2: call audit_verify(...)
- step 3: call memory_pack(...)
verification:
- proof_chain_valid == true
- all_hashes_match == true
---
The 22 Skills (6 Tiers)
| Tier | Count | Skills | Purpose |
|---|---|---|---|
| PM & Orchestration | 2 | project-manager, tier1-chains | Coordinate workflows, hand off between phases |
| Task & Roadmap | 2 | task-management, roadmap-progress | CRUD on tasks, track milestones |
| GSD & Execution | 2 | gsd-execution, autonomous | Run workflows, execute phases |
| Audit & Proof | 3 | audit-proof, memory-context, verification | Build audit trails, generate proofs |
| Infrastructure | 2 | mcp-server, observability | Server ops, monitoring |
| Integration | 1 | obsidian-integration | External system sync |
Note: This is the target design. Phase 0 has not yet begun; zero TypeScript code exists.
Why This Tier Structure Works
- Lower tiers (PM, Task, GSD) are high-frequency, high-visibility — executed most often
- Middle tiers (Audit, Infrastructure) are support layers — called by lower tiers or on demand
- Upper tiers (Integration) are specialized — called rarely, only when external sync is needed
This hierarchy ensures that the most-tested, most-stable skills run the core loop, while specialized skills are isolated and can be upgraded independently.
2. The Unit of Work: The Phase
A phase is a sequential stage in a multi-phase workflow. Phase N+1 does NOT start until phase N completes.
Why Phases Matter
Without phases, scaling collapses:
- All agents would run in parallel, and you’d have no way to enforce dependencies
- Earlier results wouldn’t be available to later tasks
- Rollback would be impossible — later tasks might depend on earlier results
- Testing each stage independently would be impossible
With phases:
- Each phase has a clear input (results from phase N-1) and output (results for phase N+1)
- You can verify each phase independently
- If phase 3 fails, you can restart from phase 3 without re-running phases 1 and 2
- Each phase can use a different agent type, optimized for that kind of work
The 5-Phase Workflow
A typical multi-agent workflow in Colibri looks like:
Workflow ID: wf-2024-001
├── Phase 1: audit (research agent)
│ Input: task description, dependencies
│ Output: audit report, risk assessment
│ Agent type: RESEARCH
│
├── Phase 2: contract (roadmap agent)
│ Input: audit report
│ Output: execution plan, phase breakdown
│ Agent type: ROADMAP
│
├── Phase 3: execution packet (planning agent)
│ Input: execution plan
│ Output: task assignments, resource budget
│ Agent type: PLANNING
│
├── Phase 4: implementation (coder agent)
│ Input: task assignments
│ Output: code, tests, documentation
│ Agent type: IMPLEMENTATION
│
└── Phase 5: verification (reviewer agent)
Input: code, tests, documentation
Output: verification report, Merkle proof
Agent type: REVIEWER
Each phase is represented in the database:
interface WorkflowPhase {
id: string; // 'phase-5-001'
workflow_id: string; // 'wf-2024-001'
phase_num: number; // 1, 2, 3, 4, 5
agent_type: "RESEARCH" | "ROADMAP" | "PLANNING" | "IMPLEMENTATION" | "REVIEWER";
status: "PENDING" | "RUNNING" | "COMPLETED" | "FAILED";
result: {
output_hash: string; // SHA256 of phase results
intermediate_results: Record<string, any>;
elapsed_time_ms: number;
};
created_at: number; // epoch ms
completed_at?: number; // epoch ms (set when status changes to COMPLETED)
}
Why Phases Are Sequential
The design enforces: phase N+1 starts only when phase N completes (COMPLETED or FAILED).
This is not a limitation — it’s a feature. Here’s why:
- Determinism: If phases run in parallel, different interleavings could produce different results. By running sequentially, you get one canonical ordering.
- Dependency clarity: Phase 4 (implementation) depends on phase 3 (execution packet). If phase 3 fails, phase 4 doesn’t start — no wasted work.
- Rollback safety: If phase 5 (verification) fails, you can mark the workflow as failed and restart from phase 1. The database is clean.
- Testing isolation: You can stub out phases 2–5 and test phase 1 in isolation. Then stub out phase 1 and test phase 2, etc.
Checkpointing: Restart-Safety
At each phase boundary, Colibri writes a checkpoint to the database:
INSERT INTO workflow_checkpoints (workflow_id, phase_num, checkpoint_data)
VALUES (
'wf-2024-001',
3,
JSON_OBJECT(
'phase_results', <JSON>,
'elapsed_time_ms', 180000,
'agent_id', 'agent-planning-001',
'memory_pack', <compressed memory>
)
);
If the server crashes at any point, recovery looks up the latest checkpoint and resumes from there. No work is lost; no phase runs twice.
3. The Contract: Writeback
The writeback contract is the binding agreement between a parent workflow and its child agent:
“Before you terminate, you must produce task_update + thought_record. If you don’t, you will be flagged as orphaned and the workflow will escalate.”
The 3-Item Writeback Contract
Every agent MUST produce these three outputs before termination:
task_update— status (done/failed/blocked), progress (0–100%), summarythought_record— task_id, branch name, commit SHA, tests run, blockersmemory_pack(optional but recommended) — compress working memory into long-term store
Why This Contract Exists
Without this contract, you have no way to know if an agent:
- Actually completed its work or just exited gracefully
- Produced results or failed silently
- Left the worktree in a valid state or crashed mid-operation
With the contract, you have a proof that the agent did real work:
task_updateproves the agent knows the outcomethought_recordproves the agent wrote its reasoning to the audit trail- The combination lets you verify every intermediate step
Data Shapes
// task_update (MCP tool call)
interface TaskUpdateParams {
task_id: string;
status: "done" | "failed" | "blocked";
progress: number; // 0–100
summary: string; // one-line summary
}
// thought_record (MCP tool call)
interface ThoughtRecordParams {
task_id: string;
branch: string; // git branch name
commit_sha: string; // git commit SHA
tests_run: number; // count of tests
tests_passed: number; // count passing
blockers: string[]; // array of blocking issues
}
// memory_pack (MCP tool call)
interface MemoryPackParams {
memory_json: string; // compressed JSON of working memory
retention_level: "short_term" | "medium_term" | "long_term";
ttl_epochs?: number; // expiry (optional)
}
Enforcement: Convention vs. Hard Block
The writeback contract is enforced at convention level, not runtime. This means:
- Agents that skip writeback are not blocked by the system
- Instead, they are flagged as orphaned in a periodic scan
- Warnings are logged:
WARN [recovery] Agent #agent-impl-042 terminated without writeback - The parent workflow escalates: marks the phase as FAILED, triggers fallback
Why convention, not hard block?
Because the system cannot force an agent to call task_update if the agent’s process crashes, network connection dies, or the code is buggy. A hard block would only cause deadlocks, not prevent orphaned agents. Flagging + escalation is more honest: “We detected you didn’t report back; we’re treating this as failure.”
Orphan Detection Recovery
Every 60 seconds, a recovery process:
- Scans the
agentstable for agents in state BUSY whose last heartbeat > 5 minutes ago - Checks the
mcp_thoughttable for recent thought_record entries from those agents - If no recent thought_record: marks agent as FAILED, logs warning, triggers workflow escalation
-- Recovery scan pseudocode
SELECT a.id, a.task_id, a.last_heartbeat
FROM agents a
WHERE a.state = 'BUSY'
AND a.last_heartbeat < (NOW() - INTERVAL '5 minutes')
AND NOT EXISTS (
SELECT 1 FROM mcp_thought
WHERE agent_id = a.id
AND created_at > a.last_heartbeat
);
4. The Pool: Agent Distribution
An agent pool is a group of agents assigned to handle a particular phase. The pool distributes incoming work according to a configurable strategy.
The 5 Pool Strategies
| Strategy | When to use | Example |
|---|---|---|
| FIFO | Low variance, predictable load | Research agents reading documents in order |
| PRIORITY_QUEUE | Mixed priorities (urgent vs backlog) | Implementation queue with hot fixes first |
| ROUND_ROBIN | Load balancing across identical agents | Multiple code-review agents |
| LEAST_LOADED | Minimize agent idle time | Verification agents, each with different speeds |
| CAPACITY_AWARE | Agents have different capability levels | Mix of senior and junior planners |
Pool Configuration
Each phase specifies:
interface AgentPoolConfig {
workflow_id: string;
phase_num: number;
strategy: "FIFO" | "PRIORITY_QUEUE" | "ROUND_ROBIN" | "LEAST_LOADED" | "CAPACITY_AWARE";
min_size: number; // minimum agents to keep alive
max_size: number; // maximum agents to spawn
agent_type: "RESEARCH" | "ROADMAP" | "PLANNING" | "IMPLEMENTATION" | "REVIEWER";
}
Auto-Scaling
The pool size adjusts dynamically based on:
interface AutoScalingMetrics {
queue_depth: number; // tasks waiting
current_load: number; // active tasks / pool_size
throughput: number; // tasks/second (last 60s)
p99_latency: number; // 99th percentile latency
}
// Scaling decision:
// if queue_depth > (current_load * 1.5):
// new_size = min(max_size, current_size + 1)
// elif queue_depth < (current_load * 0.5) && current_size > min_size:
// new_size = max(min_size, current_size - 1)
If tasks are backing up in the queue, spawn more agents. If agents are idle, shrink the pool.
CV: Capability Profile
Each agent has a CV (capability profile) that describes what it can do:
interface AgentCV {
id: string; // 'agent-impl-042'
type: "RESEARCH" | "ROADMAP" | "PLANNING" | "IMPLEMENTATION" | "REVIEWER";
skills: string[]; // ['gsd-execution', 'code-review']
permissions: string[]; // which tools it can call
limits: {
max_concurrent_tasks: number; // usually 1
token_budget: number; // per-task token limit
timeout_seconds: number; // max execution time
};
history: {
success_rate: number; // 0.0–1.0
avg_duration_ms: number;
total_tasks_completed: number;
};
}
The task router (β) uses the CV to decide: “Which agent should handle this task?” An urgent_important code review goes to the highest-success-rate REVIEWER agent.
The 6 Agent Lifecycle States
PENDING → INITIALIZING → READY → BUSY → TERMINATED
↓ ↓ ↓ ↓
FAILED FAILED FAILED FAILED
| State | Meaning | Transitions |
|---|---|---|
| PENDING | Agent ID allocated; process not yet started | → INITIALIZING (if spawned) or → TERMINATED (if cancelled) |
| INITIALIZING | Loading skills, setting up worktree, verifying permissions | → READY (on success) or → FAILED (if setup error) |
| READY | Idle, waiting for assignment | → BUSY (task assigned) or → TERMINATED (if pool shrinks) |
| BUSY | Executing a task | → READY (on completion) or → FAILED (if error) |
| FAILED | Error state; may be retried or escalated | → PENDING (if retry) or → TERMINATED (if max retries exceeded) |
| TERMINATED | Execution complete; resources released; agent ID recycled | (final state) |
Transitions are atomic and logged:
INSERT INTO agent_state_transitions (agent_id, from_state, to_state, reason, timestamp)
VALUES ('agent-impl-042', 'BUSY', 'READY', 'task_completed', NOW());
5. The Admission Gate: Rate Limiting at Scale
Before a task can enter the system, it must pass through the admission layer implemented by κ (Rule Engine). This is where Colibri limits throughput and prevents resource exhaustion.
Token Bucket Per Event Type
Each event type (task_create, task_update, thought_record, etc.) gets a token bucket:
interface TokenBucket {
event_type: string; // 'task_create', 'thought_record', etc.
capacity: number; // max tokens
refill_rate: number; // tokens per second
current_tokens: number; // tokens available now
}
// Example: task_create bucket
// capacity: 100
// refill_rate: 5 per second
// If you call task_create 100 times instantly, the 101st call is rate-limited
// But after 20 seconds, you have 100 tokens again
This prevents a single caller from monopolizing the system.
Reputation as Backpressure
Callers with higher reputation get higher token bucket capacity:
// Pseudocode
if (caller_reputation < MIN_REPUTATION_FOR_ACTION) {
// Reject: not enough reputation
throw new Error("Insufficient reputation to create tasks");
}
// Allow, but use their reputation as the token bucket multiplier
const bucket_capacity = BASE_CAPACITY * (caller_reputation / MAX_REPUTATION);
A brand-new user (reputation = 0) gets a tiny bucket. A trusted system (reputation = 10,000) gets a large bucket. This is natural backpressure: the system trusts high-reputation callers and limits untrusted ones.
VRF Audit: 5% Sampling
Not every event is verified in full. That would be too expensive. Instead:
// On event admission:
const vrf_score = compute_vrf(event_id, epoch);
if (vrf_score % 100 < 5) { // 5% chance
// Full verification: check all constraints, hashes, signatures
audit_verify(event);
} else {
// Quick check only: reputation + token bucket
admit_quick(event);
}
This ensures that:
- 95% of events are admitted quickly
- 5% are audited deeply
- An attacker cannot predict which events will be audited (VRF is unpredictable)
- Statistically, any sustained attack will be caught
Stake Freeze at Admission
When a high-stakes task is admitted, a portion of the caller’s stake is frozen:
// On task_create with stake_required = 1000
const caller_stake = get_stake(caller);
if (caller_stake.available < 1000) {
throw new Error("Insufficient stake");
}
// Freeze the stake
freeze_stake(caller, 1000);
// stake.available -= 1000
// stake.frozen += 1000
// On task_done, release the stake
release_stake(caller, 1000);
// stake.frozen -= 1000
// stake.available += 1000
This ensures the caller has “skin in the game” — they lose real resources if they abuse the system.
6. Intelligence Scaling: The Model Router (δ)
Colibri is not a single AI model. It is a router that distributes tasks across 8 AI model candidates and selects the best fit for each job.
The 8 Model Candidates
- Claude 3.5 Sonnet (best general reasoning)
- Claude 3 Opus (complex reasoning, longer context)
- Claude 3 Haiku (fast, cheap, limited tasks)
- GPT-4 Turbo (alternatives for vendor lock-in mitigation)
- GPT-4o (vision tasks)
- Gemini Pro (cost optimization)
- Llama 2 (on-prem, compliance)
- Mixtral (specialized domains)
Intent-Driven Scoring
When a task arrives, the model router scores each candidate on:
| Dimension | Example | Scoring |
|---|---|---|
| Task complexity | “Summarize this doc” vs “Prove this theorem” | Low complexity → cheap model; high → expensive model |
| Domain expertise | “code review” vs “legal analysis” | Legal tasks → GPT-4 Turbo (better training); code → Claude |
| Token budget | Budget = 10K tokens total | Models with lower cost-per-token win |
| Latency tolerance | “Return results in 5 seconds” vs “return in 1 hour” | Tight deadline → fast model; loose → slower model |
interface ModelRoutingScore {
model: string;
complexity_score: number; // 0–100
expertise_score: number; // 0–100
cost_efficiency: number; // 0–100
latency_fit: number; // 0–100
composite_score: number; // weighted average
}
The router selects the model with the highest composite_score.
Feedback Loop: Improving Over Time
Every execution is logged:
interface RoutingDecision {
task_id: string;
selected_model: string;
routing_score: RoutingScore;
actual_latency_ms: number;
actual_cost_tokens: number;
result_quality: number; // 0–100 (from verification step)
created_at: number;
}
Periodically, the system analyzes this log:
For each (task_type, selected_model) pair:
feedback = (quality - expected_quality) / expected_quality
if feedback > 0.1:
// This model did better than expected
model_weights[model] *= (1 + 0.05)
elif feedback < -0.1:
// This model underperformed
model_weights[model] *= (1 - 0.05)
Over time, the router learns which models work best for which tasks, and its routing decisions improve.
7. A Multi-Agent Workflow: Traced
Let’s walk through a real 5-phase workflow to see how all these mechanisms work together.
Setup
Task: “Review this pull request and approve if it meets standards”
Workflow:
wf-2024-pr-review
├── Phase 1: audit (security review)
├── Phase 2: contract (compliance check)
├── Phase 3: execution (code quality review)
├── Phase 4: implementation (feature validation)
└── Phase 5: verification (final sign-off)
Phase 1: Audit (T+0s)
- Task arrives → task_create called → task ID allocated: task-pr-001
- β (Task Pipeline) creates workflow record: wf-2024-pr-review
- Phase 1 pool (RESEARCH agents) is allocated: min_size=1, max_size=2, strategy=PRIORITY_QUEUE
- Agent spawning: ε (Skill Registry) calls
gsd_agent_spawn:Returns agent ID: agent-research-1001 - Agent startup:
- State: PENDING → INITIALIZING
- Load skill: ‘audit-proof’
- Verify permissions: can call thought_record, audit_verify, memory_pack
- Set up worktree:
git checkout -b audit/pr-001 - State: INITIALIZING → READY → BUSY
- Skill execution: audit-proof workflow runs:
step 1: thought_record(task_id, description) step 2: audit_verify(pr_diff, security_checks) step 3: memory_pack(findings) - Checkpointing: At phase boundary, write checkpoint:
INSERT INTO workflow_checkpoints VALUES (wf-2024-pr-review, 1, { audit_findings: {...} }) - Writeback contract:
- Agent calls: task_update(status=done, progress=100)
- Agent calls: thought_record(task_id, branch=’audit/pr-001’, commit_sha=’abc123’, tests_run=5, tests_passed=5, blockers=[])
- Agent calls: memory_pack(…)
- Phase 1 completes (T+30s) → agent state: BUSY → READY
Phase 2: Contract (T+30s)
- β checks phase 1 result: status = COMPLETED ✓
- Phase 2 pool (ROADMAP agents) is allocated: min_size=1, max_size=1, strategy=FIFO
- Agent spawning: ε calls
gsd_agent_spawn:Returns agent ID: agent-roadmap-2001 - Agent startup: State flow PENDING → INITIALIZING → READY → BUSY
- Skill execution: roadmap-progress workflow runs with phase 1 results as input
- Checkpointing: checkpoint(wf-2024-pr-review, 2, {…})
- Writeback contract fulfilled
- Phase 2 completes (T+50s)
Phase 3: Execution (T+50s)
Same pattern — agent-planning-3001 executes gsd-execution skill, produces results, writes checkpoint.
Phase 4: Implementation (T+70s)
Agent-impl-4001 executes code-review skill. This is the longest phase (20 seconds).
Result: approval_status = “approved”, quality_score = 95.
Phase 5: Verification (T+90s)
Agent-reviewer-5001 executes verification skill:
step 1: thought_record(summary of all phases)
step 2: audit_verify(merkle proof of entire workflow)
step 3: memory_pack(complete execution trace)
step 4: thought_record(final sign-off)
Writeback contract:
task_update(status=done, progress=100, summary="PR approved: all phases passed")
thought_record(
task_id=task-pr-001,
branch=feature/pr-review,
commit_sha=def456,
tests_run=50,
tests_passed=50,
blockers=[]
)
Workflow Complete (T+100s)
UPDATE gsd_workflows SET status='COMPLETED', result_hash='...' WHERE id='wf-2024-pr-review';
INSERT INTO mcp_merkle (workflow_id, hash) VALUES ('wf-2024-pr-review', '...');
Summary:
- 5 agents, 5 phases, 100 seconds total
- Each agent executed a skill independently
- Each phase’s output became the next phase’s input
- Every agent produced task_update + thought_record
- All results are auditable via the Merkle tree
- If phase 4 had failed, we’d restart from phase 4 (checkpoint at phase 3), not from phase 1
8. Scale Limits in the Design
Single-Writer SQLite: The Bottleneck
Colibri uses SQLite with single-writer access. This means:
- One process owns the database file at any given time
- Read concurrency is possible (WAL mode)
- Write operations are serialized by a tool-level lock (
tool-lockmiddleware)
When does this matter?
- Up to ~100 concurrent tasks: SQLite is fine. Write latency is ~1–5ms per operation.
- 100–1000 concurrent tasks: SQLite becomes a bottleneck. Write latency climbs to 10–50ms.
- 1000+ concurrent tasks: Effectively infeasible. WAL contention, checkpoint blocking, memory pressure.
Phase 0 scope: 50–100 concurrent tasks. Single-writer SQLite is sufficient.
Phase 3 scope (P3 includes θ Consensus): Multi-node P2P network, each node with its own SQLite instance. Consensus mechanisms coordinate state across nodes, eliminating the single-writer bottleneck.
Why Horizontal Scaling Is Not in Phase 0
To scale beyond single-writer SQLite, you would need:
- Distributed consensus — Multiple nodes agree on state without a central authority
- Eventual consistency — Nodes may temporarily diverge, then converge
- Conflict resolution — If two nodes produce different results, a tiebreaker decides
All of this is specified in θ (Consensus) but not implemented in Phase 0. It requires:
- VRF randomness for fairness
- Byzantine fault tolerance for security
- Merkle proofs for audit
- Reputation stakes for incentive alignment
θ is not implemented in Phase 0. Colibri Phase 0 is a single-node system. Phase 3 (P3.0–P3.4) will implement θ and enable true multi-node operation.
What Scales Anyway
Despite the single-writer limit, these aspects scale:
- Skill reuse — 22 skills × any number of agents
- Phase decomposition — tasks broken into 5 phases, each parallelizable within constraints
- Pool strategies — auto-scaling adjusts agent count based on queue depth
- Model routing — work distributed across 8 AI models, not just one
- VRF sampling — 95% of audit work deferred, only 5% done immediately
- Stake multiplier — reputation → larger token buckets → more throughput for trusted callers
A single phase with 10 agents can saturate to ~10 tasks in flight before hitting SQLite contention. But the serialized write still happens — it’s just that the agents are working in parallel on different tasks while the writes are batched/queued.
Summary Table: Scaling Mechanisms
| Mechanism | What It Is | Enforced By | Scales What | Scale Limit |
|---|---|---|---|---|
| Skill | Reusable tool-call sequence | Convention (SKILL.md files) | Agent capability reuse, knowledge accumulation | 22 skills × infinite agents |
| Phase | Sequential workflow stage | β state machine | Dependency ordering, checkpoint safety | 5 phases per workflow (design allows N) |
| Writeback Contract | Agent output guarantee (task_update + thought_record) | Convention (orphan flagging) | Workflow verification, result durability | Every agent must fulfill it |
| Agent Pool | Group of agents handling one phase | Pool strategy configuration | Distribution across agents, load balancing | min_size to max_size |
| Auto-scaling | Dynamic pool size adjustment | Throughput metrics (queue depth, latency) | Resource utilization | Queue depth > load triggers spawn |
| CV Registry | Agent capability profile | Task router (β) | Task-to-agent matching, capability visibility | One CV per agent type |
| Token Bucket | Rate limiting per event type | κ admission layer | Throughput fairness, spam prevention | Capacity = BASE × (reputation / MAX) |
| Reputation | Caller credibility score | history + behavior | Token bucket multiplier, stake multiplier | 0–10,000 basis points |
| VRF Audit | 5% random verification sampling | κ rule engine | Audit cost reduction (95% quick admit) | 5% of events audited deeply |
| Stake Freeze | Lock caller’s tokens at admission | κ rule engine | High-stakes task guarantee | Caller must have stake ≥ required |
| Model Router | Intent-driven AI selection | δ (Intelligence layer) | AI model diversity, cost optimization | 8 model candidates |
| Routing Feedback | Learning loop over routing decisions | δ feedback mechanism | Routing accuracy improvement | model_weights updated per feedback |
Why This Design Scales Correctly
- Skills reduce cognitive load — agents don’t reinvent the wheel; they reuse proven workflows
- Phases enforce dependencies — later work doesn’t start until earlier work is verified
- Contracts guarantee auditability — every agent must report results, or they’re flagged
- Pools distribute load — multiple agents can work in parallel within a phase
- Auto-scaling prevents exhaustion — pool size adjusts to queue depth, not fixed
- Token buckets prevent spam — high-reputation callers get higher throughput
- VRF sampling is efficient — 95% of events admit fast; 5% audit deep
- Model routing reduces cost — task-appropriate models minimize token spend
- Checkpointing ensures restart-safety — server crashes don’t lose progress
The system scales not by removing constraints, but by distributing work smartly within constraints.
See Also
- β Task Pipeline — phase definitions and state machine
- ε Skill Registry — skill definitions and agent lifecycle
- κ Rule Engine — admission rules and deterministic evaluation
- δ Model Router — intent-driven AI model selection
- ζ Decision Trail — thought records and audit logging
- Database schema — agents, pools, workflows, phases tables
- S15 GSD Contract — formal phase state machine specification
- S16 Skill Taxonomy — skill classification and composition
- Glossary — definitions for Colibri-specific terms
Links
| [[concepts/index | Concept Index]] · [[concepts/β-task-pipeline | β Task Pipeline]] · [[concepts/ε-skill-registry | ε Skill Registry]] · [[concepts/κ-rule-engine | κ Rule Engine]] · [[concepts/δ-model-router | δ Model Router]] · [[architecture/data-model | Data Model]] · [[spec/s15-gsd-contract | S15 GSD Contract]] |