Troubleshooting
Phase 0 Status: This document describes target behavior. No Colibri TypeScript code exists yet. Procedures below assume the Phase 0 server (
dist/server.jsfrom P0.2.1), database (data/colibri.dbfrom P0.2.2), 5-stage α chain (P0.2.4), and 19 MCP tools (per ADR-004) are in place.
Phase 0 Colibri is a single-node stdio MCP server. Symptoms that involve network ports, file watchers, peer gossip, or background agents do not apply here — see What is NOT in Phase 0 for the explicit list.
Symptom table
Jump to the section that matches what you observe:
| Symptom | Section |
|---|---|
| Server exits immediately at launch | Boot failures |
SQLITE_CANTOPEN / DB file missing |
Boot failures › DB file missing |
| Schema version mismatch on boot | Boot failures › Schema version mismatch |
better-sqlite3 native build failure |
Boot failures › Native build failure |
| MCP client times out on first connect | Boot failures › Stdio handshake timeout |
| Tool call never returns | Runtime failures › Tool timeout |
audit_verify_chain reports break_at |
Runtime failures › Audit chain break |
merkle_finalize fails or hangs |
Runtime failures › Merkle finalize fails |
| Zod validation error in tool response | Runtime failures › Tool validation error |
DB write unusually slow / wal file huge |
DB issues › WAL bloat |
PRAGMA integrity_check; not ok |
DB issues › Integrity check fails |
SQLITE_BUSY on a write |
DB issues › Lock contention |
| “Session not found” on tool call | Session issues › Session not found |
Task done but no matching thought_record |
Session issues › Orphan task |
Boot failures
DB file missing
Symptom: Boot fails with SQLITE_CANTOPEN: unable to open database file.
Cause: The path in COLIBRI_DB_PATH (default ./data/colibri.db) does not exist and the parent directory may also be missing.
Fix:
- Ensure the parent directory exists:
mkdir -p data. - Initialize an empty DB from the schema:
sqlite3 data/colibri.db < src/db/schema.sql - Verify:
sqlite3 data/colibri.db "PRAGMA integrity_check;"Expected output:
ok. - Restart the server.
Schema version mismatch
Symptom: Boot fails at step 3 of the boot sequence with a schema-version error. Server refuses to start.
Cause: Phase 0 does not silent-migrate. A DB on disk with a schema version that does not match the running code is treated as a hard stop.
Fix:
- If the DB is newer than the code: update the server binary.
- If the code is newer than the DB: either migrate manually (apply the delta SQL by hand) or restore from a backup taken against the current code version — see
docs/guides/backup.mdrestore runbook. - Do not hand-edit
sqlite_masterto fake a version match.
Native build failure
Symptom: npm install or first boot fails with a better-sqlite3 node-gyp / native build error.
Cause: better-sqlite3 compiles a native binding against the exact Node.js ABI. Changing the Node version, or moving the node_modules/ directory between machines, breaks the binary.
Fix:
npm rebuild better-sqlite3
If that fails, nuke and reinstall:
rm -rf node_modules
npm install
On Windows, ensure windows-build-tools (or Visual Studio C++ build tools) are installed.
Stdio handshake timeout
Symptom: The MCP client reports “server did not respond” or “timed out waiting for handshake” within a few seconds of launching the server.
Cause: Boot steps 1–5 (config load, DB open, schema validate, middleware register, tools list) take longer than the client’s handshake deadline, so the MCP stdio handshake at step 6 arrives too late.
Fix:
- Confirm
dist/server.jsexists and is the compiled output of the currentsrc/server.ts. A mismatched compile can stall imports. - Check stderr logs for a step that is hanging (enable
COLIBRI_LOG_LEVEL=DEBUG). Slow DB open usually meansCOLIBRI_DB_PATHpoints at network storage — move it to local disk. - Increase the client-side handshake timeout if it exposes one. On the server side, the boot is designed to hit stdio before heavy work; stalls imply an environment problem, not a code problem.
- Never write anything to stdout from server code. Stdout is the JSON-RPC transport; any stray
console.logpoisons the handshake. Useconsole.error(stderr) for logs.
Runtime failures
Tool timeout
Symptom: A tool call is dispatched and never returns; the client eventually gives up.
Causes to check, in order:
- DB not locked by another process. Phase 0 is single-writer. If an external SQLite client (DB Browser, another Colibri instance) holds a write lock, every tool that writes hangs. Close the other writer.
- Middleware log for the
dispatchstage. WithCOLIBRI_LOG_LEVEL=DEBUG, each of the 5 α stages (tool-lock → schema-validate → audit-enter → dispatch → audit-exit) logs its entry and exit. The stage that entered but did not exit is where the call is stuck. - Long-running query. Check
sqlite_stat1or enable SQLite query-plan logging for the stuck tool.
Audit chain break
Symptom: audit_verify_chain returns { ok: false, break_at: <n> }.
Cause: The thought_records table is HMAC-linked — each record’s prev_hash must equal the previous record’s hash. A break means either:
- A record was inserted or modified out-of-band (e.g. manual SQL edit).
- A record was deleted, leaving a hash gap.
- A write was partially applied before a crash and a stale buffer survived.
Fix:
- The chain cannot self-heal. Phase 0 treats a chain break as a legitimacy incident that must be escalated to the T0 human owner (see Escalation).
- Preserve the broken DB (copy to
data/colibri.db.chainbreak-…) so the break can be diagnosed forensically. - Restore from the most recent backup that
audit_verify_chainvalidates clean — seebackup.mdrestore runbook. - Log the gap in the next session’s first
thought_recordso the break is visible in the new chain.
Merkle finalize fails
Symptom: merkle_finalize throws mid-call, or the client sees a partial result.
Cause: Finalization runs inside a single DB transaction. A failure — disk full, DB locked, canonicalization error — rolls the whole transaction back, so the merkle_nodes table is left in its pre-call state.
Fix:
- Confirm disk space (
df -h/diron thedata/partition). - Confirm no external writer holds a lock (see Lock contention).
- Retry
merkle_finalize. Because the previous attempt rolled back, a retry starts clean and either succeeds or reproduces the same error — which now has a clear diagnosis (disk, lock, bug). - If the retry fails in a way that suggests a code bug, capture the stderr trace and escalate. Do not hand-edit
merkle_nodes.
Tool validation error
Symptom: Tool call returns a Zod error like Expected string, received number at path "session_id".
Cause: The client sent arguments that do not match the tool’s declared schema. Phase 0 validates every inbound tool call against its Zod schema in the schema-validate α stage; mismatches are rejected before dispatch runs.
Fix:
- Look up the tool’s canonical signature in
docs/reference/mcp-tools-phase-0.md. - Compare against the arguments your client is sending.
- Fix the caller — do not loosen the Zod schema server-side. The rejection is the feature, not a bug.
DB issues
WAL bloat
Symptom: data/colibri.db-wal is very large (hundreds of MB or more), disk is filling up, writes feel slow.
Cause: Checkpoint is not running (the WAL only merges back into the main DB on checkpoint). Phase 0 normally checkpoints at graceful shutdown; if the server has been up for a very long time or restarted via SIGKILL, the WAL can accumulate.
Fix:
With the server idle (no in-flight tool calls):
sqlite3 data/colibri.db "PRAGMA wal_checkpoint(TRUNCATE);"
This merges the WAL into the main DB and truncates it to zero. Safe to run against a live DB.
Integrity check fails
Symptom: sqlite3 data/colibri.db "PRAGMA integrity_check;" returns anything other than ok — messages about malformed pages, missing indexes, checksum mismatches.
Cause: Disk corruption, partially-flushed writes after power loss, or a hardware fault. SQLite does not self-repair.
Fix: Follow the restore runbook in docs/guides/backup.md. Do not try to repair in place — Phase 0 has no repair tool. Restore from the last hot-tier snapshot that passes PRAGMA integrity_check;. Keep the corrupt file for forensics.
Lock contention
Symptom: Writes fail with SQLITE_BUSY or database is locked.
Cause: In Phase 0’s single-writer model this should not happen — only one Colibri process should hold the DB. If it does happen, the cause is one of:
- A second Colibri instance launched against the same
COLIBRI_DB_PATH. - An external SQLite tool (DB Browser,
sqlite3CLI) with a write transaction open. - A previous Colibri process that did not exit cleanly and whose lock is still held by the OS.
Fix:
- List processes touching the file. On Windows:
handle64 data\colibri.db(Sysinternals). On Linux/macOS:lsof data/colibri.db. - Close the second writer.
- If no process holds the file but the lock persists, remove the
.db-wal/.db-shmcompanions (only with the server fully stopped) and reboot the server. - If a second Colibri instance was actually running, that’s a deployment bug — Phase 0 assumes one process per DB. Fix the launcher.
Session issues
Session not found
Symptom: Tool call returns “session not found” or similar.
Cause: The session_id the tool was given does not exist in the sessions table. Usually means COLIBRI_SESSION_ID was set to a value that no prior session_start registered, or the client cached a stale session id from a previous DB.
Fix:
- Check
COLIBRI_SESSION_IDin the client’s MCP launcher config. If set, make sure it matches a row insessions. - If you want Colibri to generate one at boot, unset
COLIBRI_SESSION_ID— boot will mint a fresh session and emit its id in the stderr log. - Call the session-start tool explicitly to register the id before the tool that failed.
Orphan task
Symptom: A task row is marked status = "done" but has no thought_record within 60 seconds of the task_update.
Cause: The executor updated the task but did not run the required reflection. Per the writeback protocol the final thought_record must come before merkle_finalize — a missing thought_record means the task’s reasoning is not anchored.
Phase 0 policy: This is flagged by a convention-level writeback audit (per spec/s15-… and agents/writeback-protocol.md) but is not runtime-blocking in Phase 0. The task stays done; the audit surfaces the gap.
Fix:
- Identify the orphan via the writeback audit report.
- Append a late
thought_recordexplaining why the reflection was missed (the gap is now in the chain). - Educate the executor (or the skill that dispatched it) so the convention holds next time. Phase 0 ships enforcement at the audit layer, not the middleware layer — tightening to hard-block is a Phase 1+ change.
What is NOT in Phase 0
The following failure classes do not exist in Phase 0 because the feature does not ship. If a runbook or a web search result mentions them, it is donor-era AMS material — not applicable.
| Non-symptom | Why |
|---|---|
EADDRINUSE, port-in-use errors |
Phase 0 is stdio-only. There is no port. |
| File watcher crashes, inotify / ReadDirectoryChangesW errors, >200-entry limits, symlink issues | Phase 0 ships no file watcher per S17 §2. |
| P2P peer sync failures, no peers found, gossip rejected, IHAVE mismatches, fork mismatch, clock drift warnings | Phase 0 has no peers. P2P is deferred to Phase 3+. |
Agent spawn failures, orphaned agent processes, agent_spawn / agent_status / agent_list errors |
Phase 0 has no agent spawning. Sub-agents are dispatched by the host client’s Task tool, not by Colibri. Agent tools are deferred to Phase 1.5 per ADR-005. |
| Auth / JWT / ACL / rate-limit errors | Phase 0 has no auth and no rate limiter. Deferred to Phase 2+ per spec/s13-hardening.md. |
| Horizontal scaling / replica sync / leader election | Phase 0 is single-node single-writer. See docs/3-world/execution/scale.md. |
These capabilities return, one axis at a time, across Phase 1+. See docs/5-time/roadmap.md for the phase map.
Escalation
If none of the sections above covers the failure, or the fix fails to stick:
- Capture stderr logs (with
COLIBRI_LOG_LEVEL=DEBUG), theintegrity_checkoutput, and the last clean backup id. - File a PM handoff per
docs/agents/pm-contract.md. PM (T2) decides whether to sub-agent the investigation or escalate to T0 (human owner). - For any symptom that involves the audit chain or the Merkle tree — escalate to T0 directly, do not attempt self-repair. Legitimacy-axis corruption is out of scope for T2 and T3.
Cross-links
docs/guides/backup.md— backup cadence, restore runbook, corruption recovery.docs/2-plugin/boot.md— 6-step boot sequence referenced by the boot-failure section.docs/2-plugin/health.md—server_healthtool surface for runtime diagnostics.docs/3-world/execution/decision-trail.md—audit_verify_chainsemantics and HMAC linkage.docs/agents/pm-contract.md— escalation protocol to T2.docs/reference/mcp-tools-phase-0.md— the 19 tools and their signatures.