P0.6.2 Skills-Loader ESM Deadlock — Audit
Task: 109b31c3-6419-4019-a14a-2fb1fb26f355
Branch: feature/p0-6-2-skills-deadlock-fix
Worktree: .worktrees/claude/p0-6-2-skills-deadlock-fix
Base: origin/main @ 6345ba7a
ζ thought (plan): 4b95bb8b80989ed8fd634c29646812ef119641a1ba285b7bd337b83e4cec3117
1. Symptom
After first live wire-up of the Phase 0 MCP server to Claude Desktop, three system-axis tools return valid JSON:
| Tool call | Result |
|---|---|
server_ping |
{ok:true, data:{version:"0.0.1", mode:"FULL", uptime_ms:482017}} |
server_health |
{status:"ok", phase:"phase2", db_tables:5, skill_count:0, mode:"FULL"} |
skill_list |
{ok:true, data:{skills:[], total_count:0}} |
skill_count: 0 despite the 23 canonical colibri-* SKILL.md files at .agents/skills/. The disk corpus is intact; the in-memory registry is empty.
2. Surface inventory
Three source files participate in the deadlock cycle. All paths relative to repo root.
| File | Line | Role |
|---|---|---|
src/server.ts |
50 | Static import of registerSkillTools from ./domains/skills/repository.js |
src/server.ts |
596–607 | if (isInvokedAsScript()) block — script-invocation IIFE with top-level await import('./startup.js') then await startup({...}) |
src/server.ts |
601–604 | Comment WARNING that “re-entry during this IIFE’s evaluation would deadlock Node’s ES-module loader and trigger an ‘unsettled top-level await’ diagnostic” |
src/server.ts |
606 | Forwards bootstrapFn: bootstrap, stopFn: stop to startup options — but not loadSkillsFn |
src/startup.ts |
109–116 | Defines options.skillsRoot and options.loadSkillsFn test seams |
src/startup.ts |
265 | logger('[Startup] Phase 2: heavy-init...') — last log line that fires in production |
src/startup.ts |
277 | ctx.phase = 'phase2' — confirmed reached (server_health reports phase: "phase2") |
src/startup.ts |
281–282 | Comment ASSERTING that “Dynamic import avoids the server.ts → startup.ts → repository.ts → server.ts circular-static-import deadlock” — factually wrong |
src/startup.ts |
296–298 | const loadSkillsFn = options.loadSkillsFn ?? (await import('./domains/skills/repository.js')).loadSkillsFromDisk; — the hang point |
src/startup.ts |
299 | loadSkillsFn(db, skillsRootPath, logger) — never reached in production |
src/startup.ts |
300–302 | logger('[Startup] Skills: loaded=…') — never fires in production |
src/domains/skills/repository.ts |
36 | import { registerColibriTool } from '../../server.js'; — closes the static cycle back to server.ts |
The cycle, rendered:
server.ts (top-level await on line 605)
├─ static: → repository.ts (for registerSkillTools, line 50)
│ └─ static: → server.ts (for registerColibriTool, repository.ts:36) [CYCLE]
└─ dynamic: → startup.ts (line 605)
└─ inside startup() Phase 2: dynamic → repository.ts (startup.ts:298) [DEADLOCK]
When await import('./domains/skills/repository.js') fires at startup.ts:298, Node’s ESM loader needs to resolve repository.ts. repository.ts is in a circular dependency with server.ts (via line 36’s import of registerColibriTool). server.ts cannot finish evaluating because its top-level await startup({...}) is currently suspended inside Phase 2 — which is waiting for the very await import that needs server.ts to finish. Mutual wait.
3. Evidence
3.1. Claude Desktop spawn log
%APPDATA%\Claude\logs\mcp-server-colibri.log shows three boot cycles (cwd-attempt, env-attempt, COLIBRI_SKILLS_ROOT-attempt). Each prints Phase 1 logs and [Startup] Phase 2: heavy-init..., then never prints [Startup] Skills: loaded=N, despite the server remaining responsive to all subsequent tools/list and per-tool calls. Sample:
2026-05-06T12:15:18.642Z [colibri] [info] Server started and connected successfully
[Startup] Phase 1: transport...
[colibri] starting in mode= FULL version= 0.0.1
[colibri] ready
[Startup] Phase 1 ready
[Startup] Phase 2: heavy-init...
2026-05-06T12:15:21.156Z [colibri] [info] Message from client: {"method":"initialize",...}
... tools/list responds with all 14 tools registered in Phase 1 ...
(no further server stderr; transport stays open; Phase 2 never completes)
3.2. Standalone reproduction
A 30-second smoke test from the repo root reproduces the same hang independent of Claude Desktop:
$ COLIBRI_DB_PATH=... COLIBRI_SKILLS_ROOT=... COLIBRI_MODE=FULL NODE_ENV=production \
timeout 30 node dist/server.js < /dev/null
[Startup] Phase 1: transport...
[colibri] starting in mode= FULL version= 0.0.1
[colibri] ready
[Startup] Phase 1 ready
[Startup] Phase 2: heavy-init...
Warning: Detected unsettled top-level await at file:///E:/AMS/dist/server.js:430
await startup({ bootstrapFn: bootstrap, stopFn: stop });
[exit code 13]
Node’s own diagnostic (“unsettled top-level await”) names the exact line server.ts:606 (compiled dist/server.js:430) — confirming the deadlock attribution.
3.3. Isolated function works
Calling loadSkillsFromDisk directly from a fresh node process (no server.ts in the dependency graph) loads all 23 skills in 230 ms:
[test] initDb done 181 ms
[test] loadSkillsFromDisk...
[skill-load] [colibri] skills loaded: 23, skipped: 0, pruned: 0
[test] result: {"loaded":23,"skipped":0,"pruned":0,"total_on_disk":23}
[test] elapsed: 230 ms
Confirms the loader logic itself is correct; the bug is exclusively in the production wiring.
4. How tests missed this
Phase 0 ships 1409 passing tests. Two factors masked the deadlock:
- Jest never enters the script-invocation IIFE. server.ts:596 gates the
await startup(...)block behindisInvokedAsScript(); under Jest, that returns false, and startup() is invoked directly from test code — not from inside a top-level await of the same module. - The startup-suite tests inject
options.loadSkillsFnas a mock to bypass disk I/O. Per startup.ts:111 (docstring): “OverrideloadSkillsFromDisk— used by tests to bypass disk I/O.” Whenoptions.loadSkillsFnis non-null, the??chain short-circuits beforeawait importfires (startup.ts:296–298). The test seam was, ironically, the very mechanism that hid the deadlock for two months.
The C3 post-R83 review added COLIBRI_SKILLS_ROOT env var as an escape hatch for the cwd-relative default — but the review never executed the production script-invocation path, so the deadlock was not caught.
5. Constraints on the fix
- Cannot reorganize the cycle without scope. Breaking
repository.ts:36 → server.tswould touch every domain (tasks, trail, proof, skills) — out of scope for a hotfix. - Must keep
options.loadSkillsFntest seam. Removing it would force tests to do real disk I/O and pull in.agents/skills/fixtures. - Must work via static-only resolution at the script-invocation site. Any path that re-enters the ESM loader during the suspended top-level await deadlocks.
- Regression test must exercise the production path, not the test seam. A test that injects
loadSkillsFnwould simply re-mask the bug.
6. Hypothesis to validate in Contract step
The intended behavior — proven by the 230 ms isolated test — is that loadSkillsFromDisk is a pure synchronous function safe to invoke once the SQLite handle is open. The fix should make the script-invocation site resolve loadSkillsFromDisk statically (alongside the existing static import of registerSkillTools at server.ts:50) and forward it through options.loadSkillsFn at server.ts:606 — exactly the seam tests already use, repurposed for production. The dynamic-import branch in startup.ts:298 then becomes dead code in script-invocation paths but stays alive as a defense-in-depth fallback.
Goes to Contract step (ANALYZE).