P0.6.2 Skills-Loader ESM Deadlock — Audit

Task: 109b31c3-6419-4019-a14a-2fb1fb26f355
Branch: feature/p0-6-2-skills-deadlock-fix
Worktree: .worktrees/claude/p0-6-2-skills-deadlock-fix
Base: origin/main @ 6345ba7a
ζ thought (plan): 4b95bb8b80989ed8fd634c29646812ef119641a1ba285b7bd337b83e4cec3117

1. Symptom

After first live wire-up of the Phase 0 MCP server to Claude Desktop, three system-axis tools return valid JSON:

Tool call Result
server_ping {ok:true, data:{version:"0.0.1", mode:"FULL", uptime_ms:482017}}
server_health {status:"ok", phase:"phase2", db_tables:5, skill_count:0, mode:"FULL"}
skill_list {ok:true, data:{skills:[], total_count:0}}

skill_count: 0 despite the 23 canonical colibri-* SKILL.md files at .agents/skills/. The disk corpus is intact; the in-memory registry is empty.

2. Surface inventory

Three source files participate in the deadlock cycle. All paths relative to repo root.

File Line Role
src/server.ts 50 Static import of registerSkillTools from ./domains/skills/repository.js
src/server.ts 596–607 if (isInvokedAsScript()) block — script-invocation IIFE with top-level await import('./startup.js') then await startup({...})
src/server.ts 601–604 Comment WARNING that “re-entry during this IIFE’s evaluation would deadlock Node’s ES-module loader and trigger an ‘unsettled top-level await’ diagnostic”
src/server.ts 606 Forwards bootstrapFn: bootstrap, stopFn: stop to startup options — but not loadSkillsFn
src/startup.ts 109–116 Defines options.skillsRoot and options.loadSkillsFn test seams
src/startup.ts 265 logger('[Startup] Phase 2: heavy-init...') — last log line that fires in production
src/startup.ts 277 ctx.phase = 'phase2' — confirmed reached (server_health reports phase: "phase2")
src/startup.ts 281–282 Comment ASSERTING that “Dynamic import avoids the server.ts → startup.ts → repository.ts → server.ts circular-static-import deadlock” — factually wrong
src/startup.ts 296–298 const loadSkillsFn = options.loadSkillsFn ?? (await import('./domains/skills/repository.js')).loadSkillsFromDisk; — the hang point
src/startup.ts 299 loadSkillsFn(db, skillsRootPath, logger) — never reached in production
src/startup.ts 300–302 logger('[Startup] Skills: loaded=…') — never fires in production
src/domains/skills/repository.ts 36 import { registerColibriTool } from '../../server.js'; — closes the static cycle back to server.ts

The cycle, rendered:

server.ts (top-level await on line 605)
  ├─ static: → repository.ts (for registerSkillTools, line 50)
  │     └─ static: → server.ts (for registerColibriTool, repository.ts:36) [CYCLE]
  └─ dynamic: → startup.ts (line 605)
        └─ inside startup() Phase 2: dynamic → repository.ts (startup.ts:298) [DEADLOCK]

When await import('./domains/skills/repository.js') fires at startup.ts:298, Node’s ESM loader needs to resolve repository.ts. repository.ts is in a circular dependency with server.ts (via line 36’s import of registerColibriTool). server.ts cannot finish evaluating because its top-level await startup({...}) is currently suspended inside Phase 2 — which is waiting for the very await import that needs server.ts to finish. Mutual wait.

3. Evidence

3.1. Claude Desktop spawn log

%APPDATA%\Claude\logs\mcp-server-colibri.log shows three boot cycles (cwd-attempt, env-attempt, COLIBRI_SKILLS_ROOT-attempt). Each prints Phase 1 logs and [Startup] Phase 2: heavy-init..., then never prints [Startup] Skills: loaded=N, despite the server remaining responsive to all subsequent tools/list and per-tool calls. Sample:

2026-05-06T12:15:18.642Z [colibri] [info] Server started and connected successfully
[Startup] Phase 1: transport...
[colibri] starting in mode= FULL version= 0.0.1
[colibri] ready
[Startup] Phase 1 ready
[Startup] Phase 2: heavy-init...
2026-05-06T12:15:21.156Z [colibri] [info] Message from client: {"method":"initialize",...}
... tools/list responds with all 14 tools registered in Phase 1 ...
(no further server stderr; transport stays open; Phase 2 never completes)

3.2. Standalone reproduction

A 30-second smoke test from the repo root reproduces the same hang independent of Claude Desktop:

$ COLIBRI_DB_PATH=... COLIBRI_SKILLS_ROOT=... COLIBRI_MODE=FULL NODE_ENV=production \
    timeout 30 node dist/server.js < /dev/null
[Startup] Phase 1: transport...
[colibri] starting in mode= FULL version= 0.0.1
[colibri] ready
[Startup] Phase 1 ready
[Startup] Phase 2: heavy-init...
Warning: Detected unsettled top-level await at file:///E:/AMS/dist/server.js:430
    await startup({ bootstrapFn: bootstrap, stopFn: stop });
[exit code 13]

Node’s own diagnostic (“unsettled top-level await”) names the exact line server.ts:606 (compiled dist/server.js:430) — confirming the deadlock attribution.

3.3. Isolated function works

Calling loadSkillsFromDisk directly from a fresh node process (no server.ts in the dependency graph) loads all 23 skills in 230 ms:

[test] initDb done 181 ms
[test] loadSkillsFromDisk...
[skill-load] [colibri] skills loaded: 23, skipped: 0, pruned: 0
[test] result: {"loaded":23,"skipped":0,"pruned":0,"total_on_disk":23}
[test] elapsed: 230 ms

Confirms the loader logic itself is correct; the bug is exclusively in the production wiring.

4. How tests missed this

Phase 0 ships 1409 passing tests. Two factors masked the deadlock:

  1. Jest never enters the script-invocation IIFE. server.ts:596 gates the await startup(...) block behind isInvokedAsScript(); under Jest, that returns false, and startup() is invoked directly from test code — not from inside a top-level await of the same module.
  2. The startup-suite tests inject options.loadSkillsFn as a mock to bypass disk I/O. Per startup.ts:111 (docstring): “Override loadSkillsFromDisk — used by tests to bypass disk I/O.” When options.loadSkillsFn is non-null, the ?? chain short-circuits before await import fires (startup.ts:296–298). The test seam was, ironically, the very mechanism that hid the deadlock for two months.

The C3 post-R83 review added COLIBRI_SKILLS_ROOT env var as an escape hatch for the cwd-relative default — but the review never executed the production script-invocation path, so the deadlock was not caught.

5. Constraints on the fix

  1. Cannot reorganize the cycle without scope. Breaking repository.ts:36 → server.ts would touch every domain (tasks, trail, proof, skills) — out of scope for a hotfix.
  2. Must keep options.loadSkillsFn test seam. Removing it would force tests to do real disk I/O and pull in .agents/skills/ fixtures.
  3. Must work via static-only resolution at the script-invocation site. Any path that re-enters the ESM loader during the suspended top-level await deadlocks.
  4. Regression test must exercise the production path, not the test seam. A test that injects loadSkillsFn would simply re-mask the bug.

6. Hypothesis to validate in Contract step

The intended behavior — proven by the 230 ms isolated test — is that loadSkillsFromDisk is a pure synchronous function safe to invoke once the SQLite handle is open. The fix should make the script-invocation site resolve loadSkillsFromDisk statically (alongside the existing static import of registerSkillTools at server.ts:50) and forward it through options.loadSkillsFn at server.ts:606 — exactly the seam tests already use, repurposed for production. The dynamic-import branch in startup.ts:298 then becomes dead code in script-invocation paths but stays alive as a defense-in-depth fallback.

Goes to Contract step (ANALYZE).


Back to top

Colibri — documentation-first MCP runtime. Apache 2.0 + Commons Clause.

This site uses Just the Docs, a documentation theme for Jekyll.