Troubleshooting

Phase 0 Status: This document describes target behavior. No Colibri TypeScript code exists yet. Procedures below assume the Phase 0 server (dist/server.js from P0.2.1), database (data/colibri.db from P0.2.2), 5-stage α chain (P0.2.4), and 19 MCP tools (per ADR-004) are in place.

Phase 0 Colibri is a single-node stdio MCP server. Symptoms that involve network ports, file watchers, peer gossip, or background agents do not apply here — see What is NOT in Phase 0 for the explicit list.

Symptom table

Jump to the section that matches what you observe:

Symptom	Section
Server exits immediately at launch	Boot failures
`SQLITE_CANTOPEN` / DB file missing	Boot failures › DB file missing
Schema version mismatch on boot	Boot failures › Schema version mismatch
`better-sqlite3` native build failure	Boot failures › Native build failure
MCP client times out on first connect	Boot failures › Stdio handshake timeout
Tool call never returns	Runtime failures › Tool timeout
`audit_verify_chain` reports `break_at`	Runtime failures › Audit chain break
`merkle_finalize` fails or hangs	Runtime failures › Merkle finalize fails
Zod validation error in tool response	Runtime failures › Tool validation error
DB write unusually slow / `wal` file huge	DB issues › WAL bloat
`PRAGMA integrity_check;` not `ok`	DB issues › Integrity check fails
`SQLITE_BUSY` on a write	DB issues › Lock contention
“Session not found” on tool call	Session issues › Session not found
Task `done` but no matching `thought_record`	Session issues › Orphan task

Boot failures

DB file missing

Symptom: Boot fails with SQLITE_CANTOPEN: unable to open database file.

Cause: The path in COLIBRI_DB_PATH (default ./data/colibri.db) does not exist and the parent directory may also be missing.

Fix:

Ensure the parent directory exists: mkdir -p data.

Initialize an empty DB from the schema:

sqlite3 data/colibri.db < src/db/schema.sql

Verify:

sqlite3 data/colibri.db "PRAGMA integrity_check;"

Expected output: ok.

Restart the server.

Schema version mismatch

Symptom: Boot fails at step 3 of the boot sequence with a schema-version error. Server refuses to start.

Cause: Phase 0 does not silent-migrate. A DB on disk with a schema version that does not match the running code is treated as a hard stop.

Fix:

If the DB is newer than the code: update the server binary.
If the code is newer than the DB: either migrate manually (apply the delta SQL by hand) or restore from a backup taken against the current code version — see docs/guides/backup.md restore runbook.
Do not hand-edit sqlite_master to fake a version match.

Native build failure

Symptom: npm install or first boot fails with a better-sqlite3 node-gyp / native build error.

Cause: better-sqlite3 compiles a native binding against the exact Node.js ABI. Changing the Node version, or moving the node_modules/ directory between machines, breaks the binary.

Fix:

npm rebuild better-sqlite3

If that fails, nuke and reinstall:

rm -rf node_modules
npm install

On Windows, ensure windows-build-tools (or Visual Studio C++ build tools) are installed.

Stdio handshake timeout

Symptom: The MCP client reports “server did not respond” or “timed out waiting for handshake” within a few seconds of launching the server.

Cause: Boot steps 1–5 (config load, DB open, schema validate, middleware register, tools list) take longer than the client’s handshake deadline, so the MCP stdio handshake at step 6 arrives too late.

Fix:

Confirm dist/server.js exists and is the compiled output of the current src/server.ts. A mismatched compile can stall imports.
Check stderr logs for a step that is hanging (enable COLIBRI_LOG_LEVEL=DEBUG). Slow DB open usually means COLIBRI_DB_PATH points at network storage — move it to local disk.
Increase the client-side handshake timeout if it exposes one. On the server side, the boot is designed to hit stdio before heavy work; stalls imply an environment problem, not a code problem.
Never write anything to stdout from server code. Stdout is the JSON-RPC transport; any stray console.log poisons the handshake. Use console.error (stderr) for logs.

Runtime failures

Tool timeout

Symptom: A tool call is dispatched and never returns; the client eventually gives up.

Causes to check, in order:

DB not locked by another process. Phase 0 is single-writer. If an external SQLite client (DB Browser, another Colibri instance) holds a write lock, every tool that writes hangs. Close the other writer.
Middleware log for the dispatch stage. With COLIBRI_LOG_LEVEL=DEBUG, each of the 5 α stages (tool-lock → schema-validate → audit-enter → dispatch → audit-exit) logs its entry and exit. The stage that entered but did not exit is where the call is stuck.
Long-running query. Check sqlite_stat1 or enable SQLite query-plan logging for the stuck tool.

Audit chain break

Symptom: audit_verify_chain returns { ok: false, break_at: <n> }.

Cause: The thought_records table is HMAC-linked — each record’s prev_hash must equal the previous record’s hash. A break means either:

A record was inserted or modified out-of-band (e.g. manual SQL edit).
A record was deleted, leaving a hash gap.
A write was partially applied before a crash and a stale buffer survived.

Fix:

The chain cannot self-heal. Phase 0 treats a chain break as a legitimacy incident that must be escalated to the T0 human owner (see Escalation).
Preserve the broken DB (copy to data/colibri.db.chainbreak-…) so the break can be diagnosed forensically.
Restore from the most recent backup that audit_verify_chain validates clean — see backup.md restore runbook.
Log the gap in the next session’s first thought_record so the break is visible in the new chain.

Merkle finalize fails

Symptom: merkle_finalize throws mid-call, or the client sees a partial result.

Cause: Finalization runs inside a single DB transaction. A failure — disk full, DB locked, canonicalization error — rolls the whole transaction back, so the merkle_nodes table is left in its pre-call state.

Fix:

Confirm disk space (df -h / dir on the data/ partition).
Confirm no external writer holds a lock (see Lock contention).
Retry merkle_finalize. Because the previous attempt rolled back, a retry starts clean and either succeeds or reproduces the same error — which now has a clear diagnosis (disk, lock, bug).
If the retry fails in a way that suggests a code bug, capture the stderr trace and escalate. Do not hand-edit merkle_nodes.

Tool validation error

Symptom: Tool call returns a Zod error like Expected string, received number at path "session_id".

Cause: The client sent arguments that do not match the tool’s declared schema. Phase 0 validates every inbound tool call against its Zod schema in the schema-validate α stage; mismatches are rejected before dispatch runs.

Fix:

Look up the tool’s canonical signature in docs/reference/mcp-tools-phase-0.md.
Compare against the arguments your client is sending.
Fix the caller — do not loosen the Zod schema server-side. The rejection is the feature, not a bug.

DB issues

WAL bloat

Symptom: data/colibri.db-wal is very large (hundreds of MB or more), disk is filling up, writes feel slow.

Cause: Checkpoint is not running (the WAL only merges back into the main DB on checkpoint). Phase 0 normally checkpoints at graceful shutdown; if the server has been up for a very long time or restarted via SIGKILL, the WAL can accumulate.

Fix:

With the server idle (no in-flight tool calls):

sqlite3 data/colibri.db "PRAGMA wal_checkpoint(TRUNCATE);"

This merges the WAL into the main DB and truncates it to zero. Safe to run against a live DB.

Integrity check fails

Symptom: sqlite3 data/colibri.db "PRAGMA integrity_check;" returns anything other than ok — messages about malformed pages, missing indexes, checksum mismatches.

Cause: Disk corruption, partially-flushed writes after power loss, or a hardware fault. SQLite does not self-repair.

Fix: Follow the restore runbook in docs/guides/backup.md. Do not try to repair in place — Phase 0 has no repair tool. Restore from the last hot-tier snapshot that passes PRAGMA integrity_check;. Keep the corrupt file for forensics.

Lock contention

Symptom: Writes fail with SQLITE_BUSY or database is locked.

Cause: In Phase 0’s single-writer model this should not happen — only one Colibri process should hold the DB. If it does happen, the cause is one of:

A second Colibri instance launched against the same COLIBRI_DB_PATH.
An external SQLite tool (DB Browser, sqlite3 CLI) with a write transaction open.
A previous Colibri process that did not exit cleanly and whose lock is still held by the OS.

Fix:

List processes touching the file. On Windows: handle64 data\colibri.db (Sysinternals). On Linux/macOS: lsof data/colibri.db.
Close the second writer.
If no process holds the file but the lock persists, remove the .db-wal / .db-shm companions (only with the server fully stopped) and reboot the server.
If a second Colibri instance was actually running, that’s a deployment bug — Phase 0 assumes one process per DB. Fix the launcher.

Session issues

Session not found

Symptom: Tool call returns “session not found” or similar.

Cause: The session_id the tool was given does not exist in the sessions table. Usually means COLIBRI_SESSION_ID was set to a value that no prior session_start registered, or the client cached a stale session id from a previous DB.

Fix:

Check COLIBRI_SESSION_ID in the client’s MCP launcher config. If set, make sure it matches a row in sessions.
If you want Colibri to generate one at boot, unset COLIBRI_SESSION_ID — boot will mint a fresh session and emit its id in the stderr log.
Call the session-start tool explicitly to register the id before the tool that failed.

Orphan task

Symptom: A task row is marked status = "done" but has no thought_record within 60 seconds of the task_update.

Cause: The executor updated the task but did not run the required reflection. Per the writeback protocol the final thought_record must come before merkle_finalize — a missing thought_record means the task’s reasoning is not anchored.

Phase 0 policy: This is flagged by a convention-level writeback audit (per spec/s15-… and agents/writeback-protocol.md) but is not runtime-blocking in Phase 0. The task stays done; the audit surfaces the gap.

Fix:

Identify the orphan via the writeback audit report.
Append a late thought_record explaining why the reflection was missed (the gap is now in the chain).
Educate the executor (or the skill that dispatched it) so the convention holds next time. Phase 0 ships enforcement at the audit layer, not the middleware layer — tightening to hard-block is a Phase 1+ change.

What is NOT in Phase 0

The following failure classes do not exist in Phase 0 because the feature does not ship. If a runbook or a web search result mentions them, it is donor-era AMS material — not applicable.

Non-symptom	Why
`EADDRINUSE`, port-in-use errors	Phase 0 is stdio-only. There is no port.
File watcher crashes, inotify / ReadDirectoryChangesW errors, >200-entry limits, symlink issues	Phase 0 ships no file watcher per S17 §2.
P2P peer sync failures, no peers found, gossip rejected, IHAVE mismatches, fork mismatch, clock drift warnings	Phase 0 has no peers. P2P is deferred to Phase 3+.
Agent spawn failures, orphaned agent processes, `agent_spawn` / `agent_status` / `agent_list` errors	Phase 0 has no agent spawning. Sub-agents are dispatched by the host client’s Task tool, not by Colibri. Agent tools are deferred to Phase 1.5 per ADR-005.
Auth / JWT / ACL / rate-limit errors	Phase 0 has no auth and no rate limiter. Deferred to Phase 2+ per `spec/s13-hardening.md`.
Horizontal scaling / replica sync / leader election	Phase 0 is single-node single-writer. See `docs/3-world/execution/scale.md`.

These capabilities return, one axis at a time, across Phase 1+. See docs/5-time/roadmap.md for the phase map.

Escalation

If none of the sections above covers the failure, or the fix fails to stick:

Capture stderr logs (with COLIBRI_LOG_LEVEL=DEBUG), the integrity_check output, and the last clean backup id.
File a PM handoff per docs/agents/pm-contract.md. PM (T2) decides whether to sub-agent the investigation or escalate to T0 (human owner).
For any symptom that involves the audit chain or the Merkle tree — escalate to T0 directly, do not attempt self-repair. Legitimacy-axis corruption is out of scope for T2 and T3.

Cross-links

docs/guides/backup.md — backup cadence, restore runbook, corruption recovery.
docs/2-plugin/boot.md — 6-step boot sequence referenced by the boot-failure section.
docs/2-plugin/health.md — server_health tool surface for runtime diagnostics.
docs/3-world/execution/decision-trail.md — audit_verify_chain semantics and HMAC linkage.
docs/agents/pm-contract.md — escalation protocol to T2.
docs/reference/mcp-tools-phase-0.md — the 19 tools and their signatures.