Troubleshooting

Phase 0 Status: This document describes target behavior. No Colibri TypeScript code exists yet. Procedures below assume the Phase 0 server (dist/server.js from P0.2.1), database (data/colibri.db from P0.2.2), 5-stage α chain (P0.2.4), and 19 MCP tools (per ADR-004) are in place.

Phase 0 Colibri is a single-node stdio MCP server. Symptoms that involve network ports, file watchers, peer gossip, or background agents do not apply here — see What is NOT in Phase 0 for the explicit list.

Symptom table

Jump to the section that matches what you observe:

Symptom Section
Server exits immediately at launch Boot failures
SQLITE_CANTOPEN / DB file missing Boot failures › DB file missing
Schema version mismatch on boot Boot failures › Schema version mismatch
better-sqlite3 native build failure Boot failures › Native build failure
MCP client times out on first connect Boot failures › Stdio handshake timeout
Tool call never returns Runtime failures › Tool timeout
audit_verify_chain reports break_at Runtime failures › Audit chain break
merkle_finalize fails or hangs Runtime failures › Merkle finalize fails
Zod validation error in tool response Runtime failures › Tool validation error
DB write unusually slow / wal file huge DB issues › WAL bloat
PRAGMA integrity_check; not ok DB issues › Integrity check fails
SQLITE_BUSY on a write DB issues › Lock contention
“Session not found” on tool call Session issues › Session not found
Task done but no matching thought_record Session issues › Orphan task

Boot failures

DB file missing

Symptom: Boot fails with SQLITE_CANTOPEN: unable to open database file.

Cause: The path in COLIBRI_DB_PATH (default ./data/colibri.db) does not exist and the parent directory may also be missing.

Fix:

  1. Ensure the parent directory exists: mkdir -p data.
  2. Initialize an empty DB from the schema:
    sqlite3 data/colibri.db < src/db/schema.sql
    
  3. Verify:
    sqlite3 data/colibri.db "PRAGMA integrity_check;"
    

    Expected output: ok.

  4. Restart the server.

Schema version mismatch

Symptom: Boot fails at step 3 of the boot sequence with a schema-version error. Server refuses to start.

Cause: Phase 0 does not silent-migrate. A DB on disk with a schema version that does not match the running code is treated as a hard stop.

Fix:

  • If the DB is newer than the code: update the server binary.
  • If the code is newer than the DB: either migrate manually (apply the delta SQL by hand) or restore from a backup taken against the current code version — see docs/guides/backup.md restore runbook.
  • Do not hand-edit sqlite_master to fake a version match.

Native build failure

Symptom: npm install or first boot fails with a better-sqlite3 node-gyp / native build error.

Cause: better-sqlite3 compiles a native binding against the exact Node.js ABI. Changing the Node version, or moving the node_modules/ directory between machines, breaks the binary.

Fix:

npm rebuild better-sqlite3

If that fails, nuke and reinstall:

rm -rf node_modules
npm install

On Windows, ensure windows-build-tools (or Visual Studio C++ build tools) are installed.

Stdio handshake timeout

Symptom: The MCP client reports “server did not respond” or “timed out waiting for handshake” within a few seconds of launching the server.

Cause: Boot steps 1–5 (config load, DB open, schema validate, middleware register, tools list) take longer than the client’s handshake deadline, so the MCP stdio handshake at step 6 arrives too late.

Fix:

  1. Confirm dist/server.js exists and is the compiled output of the current src/server.ts. A mismatched compile can stall imports.
  2. Check stderr logs for a step that is hanging (enable COLIBRI_LOG_LEVEL=DEBUG). Slow DB open usually means COLIBRI_DB_PATH points at network storage — move it to local disk.
  3. Increase the client-side handshake timeout if it exposes one. On the server side, the boot is designed to hit stdio before heavy work; stalls imply an environment problem, not a code problem.
  4. Never write anything to stdout from server code. Stdout is the JSON-RPC transport; any stray console.log poisons the handshake. Use console.error (stderr) for logs.

Runtime failures

Tool timeout

Symptom: A tool call is dispatched and never returns; the client eventually gives up.

Causes to check, in order:

  1. DB not locked by another process. Phase 0 is single-writer. If an external SQLite client (DB Browser, another Colibri instance) holds a write lock, every tool that writes hangs. Close the other writer.
  2. Middleware log for the dispatch stage. With COLIBRI_LOG_LEVEL=DEBUG, each of the 5 α stages (tool-lock → schema-validate → audit-enter → dispatch → audit-exit) logs its entry and exit. The stage that entered but did not exit is where the call is stuck.
  3. Long-running query. Check sqlite_stat1 or enable SQLite query-plan logging for the stuck tool.

Audit chain break

Symptom: audit_verify_chain returns { ok: false, break_at: <n> }.

Cause: The thought_records table is HMAC-linked — each record’s prev_hash must equal the previous record’s hash. A break means either:

  • A record was inserted or modified out-of-band (e.g. manual SQL edit).
  • A record was deleted, leaving a hash gap.
  • A write was partially applied before a crash and a stale buffer survived.

Fix:

  • The chain cannot self-heal. Phase 0 treats a chain break as a legitimacy incident that must be escalated to the T0 human owner (see Escalation).
  • Preserve the broken DB (copy to data/colibri.db.chainbreak-…) so the break can be diagnosed forensically.
  • Restore from the most recent backup that audit_verify_chain validates clean — see backup.md restore runbook.
  • Log the gap in the next session’s first thought_record so the break is visible in the new chain.

Merkle finalize fails

Symptom: merkle_finalize throws mid-call, or the client sees a partial result.

Cause: Finalization runs inside a single DB transaction. A failure — disk full, DB locked, canonicalization error — rolls the whole transaction back, so the merkle_nodes table is left in its pre-call state.

Fix:

  1. Confirm disk space (df -h / dir on the data/ partition).
  2. Confirm no external writer holds a lock (see Lock contention).
  3. Retry merkle_finalize. Because the previous attempt rolled back, a retry starts clean and either succeeds or reproduces the same error — which now has a clear diagnosis (disk, lock, bug).
  4. If the retry fails in a way that suggests a code bug, capture the stderr trace and escalate. Do not hand-edit merkle_nodes.

Tool validation error

Symptom: Tool call returns a Zod error like Expected string, received number at path "session_id".

Cause: The client sent arguments that do not match the tool’s declared schema. Phase 0 validates every inbound tool call against its Zod schema in the schema-validate α stage; mismatches are rejected before dispatch runs.

Fix:

  1. Look up the tool’s canonical signature in docs/reference/mcp-tools-phase-0.md.
  2. Compare against the arguments your client is sending.
  3. Fix the caller — do not loosen the Zod schema server-side. The rejection is the feature, not a bug.

DB issues

WAL bloat

Symptom: data/colibri.db-wal is very large (hundreds of MB or more), disk is filling up, writes feel slow.

Cause: Checkpoint is not running (the WAL only merges back into the main DB on checkpoint). Phase 0 normally checkpoints at graceful shutdown; if the server has been up for a very long time or restarted via SIGKILL, the WAL can accumulate.

Fix:

With the server idle (no in-flight tool calls):

sqlite3 data/colibri.db "PRAGMA wal_checkpoint(TRUNCATE);"

This merges the WAL into the main DB and truncates it to zero. Safe to run against a live DB.

Integrity check fails

Symptom: sqlite3 data/colibri.db "PRAGMA integrity_check;" returns anything other than ok — messages about malformed pages, missing indexes, checksum mismatches.

Cause: Disk corruption, partially-flushed writes after power loss, or a hardware fault. SQLite does not self-repair.

Fix: Follow the restore runbook in docs/guides/backup.md. Do not try to repair in place — Phase 0 has no repair tool. Restore from the last hot-tier snapshot that passes PRAGMA integrity_check;. Keep the corrupt file for forensics.

Lock contention

Symptom: Writes fail with SQLITE_BUSY or database is locked.

Cause: In Phase 0’s single-writer model this should not happen — only one Colibri process should hold the DB. If it does happen, the cause is one of:

  • A second Colibri instance launched against the same COLIBRI_DB_PATH.
  • An external SQLite tool (DB Browser, sqlite3 CLI) with a write transaction open.
  • A previous Colibri process that did not exit cleanly and whose lock is still held by the OS.

Fix:

  1. List processes touching the file. On Windows: handle64 data\colibri.db (Sysinternals). On Linux/macOS: lsof data/colibri.db.
  2. Close the second writer.
  3. If no process holds the file but the lock persists, remove the .db-wal / .db-shm companions (only with the server fully stopped) and reboot the server.
  4. If a second Colibri instance was actually running, that’s a deployment bug — Phase 0 assumes one process per DB. Fix the launcher.

Session issues

Session not found

Symptom: Tool call returns “session not found” or similar.

Cause: The session_id the tool was given does not exist in the sessions table. Usually means COLIBRI_SESSION_ID was set to a value that no prior session_start registered, or the client cached a stale session id from a previous DB.

Fix:

  1. Check COLIBRI_SESSION_ID in the client’s MCP launcher config. If set, make sure it matches a row in sessions.
  2. If you want Colibri to generate one at boot, unset COLIBRI_SESSION_ID — boot will mint a fresh session and emit its id in the stderr log.
  3. Call the session-start tool explicitly to register the id before the tool that failed.

Orphan task

Symptom: A task row is marked status = "done" but has no thought_record within 60 seconds of the task_update.

Cause: The executor updated the task but did not run the required reflection. Per the writeback protocol the final thought_record must come before merkle_finalize — a missing thought_record means the task’s reasoning is not anchored.

Phase 0 policy: This is flagged by a convention-level writeback audit (per spec/s15-… and agents/writeback-protocol.md) but is not runtime-blocking in Phase 0. The task stays done; the audit surfaces the gap.

Fix:

  1. Identify the orphan via the writeback audit report.
  2. Append a late thought_record explaining why the reflection was missed (the gap is now in the chain).
  3. Educate the executor (or the skill that dispatched it) so the convention holds next time. Phase 0 ships enforcement at the audit layer, not the middleware layer — tightening to hard-block is a Phase 1+ change.

What is NOT in Phase 0

The following failure classes do not exist in Phase 0 because the feature does not ship. If a runbook or a web search result mentions them, it is donor-era AMS material — not applicable.

Non-symptom Why
EADDRINUSE, port-in-use errors Phase 0 is stdio-only. There is no port.
File watcher crashes, inotify / ReadDirectoryChangesW errors, >200-entry limits, symlink issues Phase 0 ships no file watcher per S17 §2.
P2P peer sync failures, no peers found, gossip rejected, IHAVE mismatches, fork mismatch, clock drift warnings Phase 0 has no peers. P2P is deferred to Phase 3+.
Agent spawn failures, orphaned agent processes, agent_spawn / agent_status / agent_list errors Phase 0 has no agent spawning. Sub-agents are dispatched by the host client’s Task tool, not by Colibri. Agent tools are deferred to Phase 1.5 per ADR-005.
Auth / JWT / ACL / rate-limit errors Phase 0 has no auth and no rate limiter. Deferred to Phase 2+ per spec/s13-hardening.md.
Horizontal scaling / replica sync / leader election Phase 0 is single-node single-writer. See docs/3-world/execution/scale.md.

These capabilities return, one axis at a time, across Phase 1+. See docs/5-time/roadmap.md for the phase map.

Escalation

If none of the sections above covers the failure, or the fix fails to stick:

  1. Capture stderr logs (with COLIBRI_LOG_LEVEL=DEBUG), the integrity_check output, and the last clean backup id.
  2. File a PM handoff per docs/agents/pm-contract.md. PM (T2) decides whether to sub-agent the investigation or escalate to T0 (human owner).
  3. For any symptom that involves the audit chain or the Merkle tree — escalate to T0 directly, do not attempt self-repair. Legitimacy-axis corruption is out of scope for T2 and T3.

Back to top

Colibri — documentation-first MCP runtime. Apache 2.0 + Commons Clause.

This site uses Just the Docs, a documentation theme for Jekyll.