P0.6.2 Skills-Loader ESM Deadlock — Contract

Task: 109b31c3-6419-4019-a14a-2fb1fb26f355
Audit: docs/audits/p0-6-2-skills-deadlock-audit.md
ζ thought (analysis): captured after audit; chained off 4b95bb8b…cec3117 (plan).

1. Scope

This contract governs the patch that closes the deadlock surfaced by the audit. In scope: changes to src/server.ts (script-invocation site) and one new regression test. Out of scope: restructuring the server.ts ↔ repository.ts static cycle; deletion of startup.ts’s dynamic-import fallback; refactoring of any other domain’s tool-registration shape; doc-loop or skill-pack changes.

2. Invariants the fix MUST hold

I-1. Production-test parity for the skill-load code path

The production startup path (script invocation of dist/server.js) MUST execute the same loadSkillsFromDisk function the test suite already exercises. No production-only branch may stand between the script-invocation site and the loader function.

Why: the deadlock survived two months because production took a different code path (dynamic import) than tests (option injection). Re-introducing any production-only branch reopens that gap.

I-2. No ESM dynamic imports during script-invocation top-level await

startup() is called from a top-level await at server.ts:606. While that await is suspended, import.meta’s module evaluation is still in progress. No await import(...) inside the call graph reachable from that await is permitted to target a module that participates in the server.ts ↔ <domain>/repository.ts static cycle.

The fix MAY leave dynamic imports of side-cycle modules (e.g. modules with no static back-edge to server.ts) — those resolve from cache without re-entering the suspended loader.

I-3. The options.loadSkillsFn seam stays first-class

startup.ts:111 (options.loadSkillsFn) MUST remain the canonical injection point — for tests AND for the production script-invocation site. Tests already use it; the fix repurposes it for production. Removing or renaming the seam breaks both.

I-4. The dynamic-import fallback at startup.ts:298 MAY remain as defense-in-depth

The fallback (?? (await import(...)).loadSkillsFromDisk) becomes unreachable from the script-invocation site once the fix lands. It MAY be left in place to support hypothetical future callers of startup() that don’t pre-resolve the loader — at the cost of one dead branch in coverage. The fix MUST NOT depend on the fallback for production correctness.

I-5. No new public API surface

The fix adds zero new exported names from src/server.ts, src/startup.ts, src/domains/skills/repository.ts. It does not add new env vars, new MCP tools, new config keys, new error codes. The options.loadSkillsFn shape (signature, error handling) is unchanged.

I-6. Idempotent on repeat boots

Running the fixed dist/server.js twice against the same data/colibri.db MUST produce the same skill_count both times (no duplicate inserts; loadSkillsFromDisk already handles upsert + prune semantics — the fix MUST NOT alter that).

3. Acceptance criteria

A correct patch satisfies ALL of:

  • A-1. tsc clean against the patched source.
  • A-2. ESLint clean (npm run lint).
  • A-3. Full Jest suite passes — npm test. No new failures, no regressions.
  • A-4. Standalone smoke test from repo root:

    COLIBRI_DB_PATH=... COLIBRI_MODE=FULL NODE_ENV=production \
      timeout 5 node dist/server.js < /dev/null 2>&1 | grep -E '\[Startup\] Skills:'
    

    prints [Startup] Skills: loaded=23, skipped=0, pruned=0 (or whatever count matches the on-disk corpus at the time) before timeout. Currently this line never appears.

  • A-5. Live MCP mcp__colibri__server_health returns skill_count > 0 after a Claude Desktop relaunch with the patched dist/.
  • A-6. Live MCP mcp__colibri__skill_list returns total_count > 0 and a non-empty skills[] array.
  • A-7. New regression test src/__tests__/scripts/script-invocation-skills-load.test.ts (name negotiable) FAILS on the pre-fix code and PASSES on the fixed code. The test MUST exercise the production injection path — i.e. it MUST verify that the script-invocation site forwards a real (not undefined) loadSkillsFn to startup. A test that injects loadSkillsFn itself via the seam does NOT satisfy this — that’s the same masking that let the bug ship.

4. Non-acceptance — these do NOT count

  • N-1. A test that simply asserts loadSkillsFromDisk works in isolation. Such a test would have passed before the fix; it doesn’t characterize the bug.
  • N-2. A patch that removes startup.ts:298’s dynamic import without forwarding via options. That would break the test-seam contract (I-3) — tests inject mocks via options.loadSkillsFn and expect the dynamic-import branch to never fire when the option is set, which is already the case; the issue is the production path, not the option-set path.
  • N-3. A patch that adds a new env var like COLIBRI_PRELOAD_SKILLS=1. Adds public API surface without need (violates I-5) and re-introduces a production-only branch (violates I-1).
  • N-4. A patch that deletes or rewrites src/domains/skills/repository.ts:36’s import of registerColibriTool. That breaks the static cycle by a different vector — out of scope and would require parallel changes to all 6 domain modules.

5. Risk register for the implementation

Risk Severity Mitigation
Patch regresses Jest test that asserts await import fires when options.loadSkillsFn === undefined Medium Inspect existing startup-suite tests; if such an assertion exists, update its mocking story to align with the fix.
Regression test asserts on log strings, which are flaky / timing-sensitive Low Test asserts on the script-invocation site’s static reference to loadSkillsFromDisk, not on logger output.
Future contributor reintroduces dynamic import without realizing the deadlock Medium Add an inline comment at server.ts:606 next to the loadSkillsFn: argument explaining why static resolution at this site is load-bearing. Cite this contract by path.
Live deploy: replacing dist/ while Claude Desktop holds it open on Windows Low dist/server.js is loaded by node at process start, then the file handle is released. Tray-quit + relaunch ensures clean swap. Contract verification step covers this.

6. Goes to Packet step

The packet defines: which exact lines change, in which order, plus the regression-test shape. Implementation begins after the packet is approved.


Back to top

Colibri — documentation-first MCP runtime. Apache 2.0 + Commons Clause.

This site uses Just the Docs, a documentation theme for Jekyll.