P0.6.2 Skills-Loader ESM Deadlock — Packet

Task: 109b31c3-6419-4019-a14a-2fb1fb26f355
Audit: docs/audits/p0-6-2-skills-deadlock-audit.md
Contract: docs/contracts/p0-6-2-skills-deadlock-contract.md
ζ thought (decision): 189f598a…5adbf (chained off analysis 2e7e56f9…c47d)

1. Patch ledger

Three files modified, one file created.

Action Path Lines Purpose
modify src/server.ts 50 Add loadSkillsFromDisk to the existing static import from ./domains/skills/repository.js
modify src/server.ts 600–606 Forward loadSkillsFn: loadSkillsFromDisk through startup() options + add anchor comment citing this packet
modify src/startup.ts 281–291 Replace incorrect “Dynamic import avoids the deadlock” comment with truth (deadlock is real; production must pre-resolve)
create src/__tests__/scripts/script-invocation-skills.test.ts new Subprocess-based regression test

No edits to: src/domains/skills/repository.ts, src/config.ts, any other domain, env-var schema, MCP tool schemas, package.json, ESLint/TS config, migrations.

2. Exact text changes

2.1. src/server.ts:50

- import { registerSkillTools } from './domains/skills/repository.js';
+ import { loadSkillsFromDisk, registerSkillTools } from './domains/skills/repository.js';

Alphabetic order preserved (l < r).

2.2. src/server.ts:600–606

   //
   // We explicitly forward `bootstrap` + `stop` through startup's options
   // bag so `startup()` does not need to re-import `./server.js` at
   // runtime; re-entry during this IIFE's evaluation would deadlock Node's
   // ES-module loader and trigger an "unsettled top-level await" diagnostic.
+  //
+  // P0.6.2 hotfix (post-R83): we ALSO forward `loadSkillsFromDisk` via
+  // `loadSkillsFn` for the same reason. The dynamic `await import` inside
+  // `startup()` Phase 2 (startup.ts:296-298) deadlocks against this very
+  // top-level await because `repository.js` participates in the
+  // `server.ts ↔ repository.ts` static cycle. Pre-resolving the loader
+  // here, while server.ts is still evaluating but `repository.js`'s
+  // exports are already populated (it was statically imported on line 50),
+  // avoids the loader re-entry. See:
+  //   docs/audits/p0-6-2-skills-deadlock-audit.md
+  //   docs/contracts/p0-6-2-skills-deadlock-contract.md
+  // DO NOT remove the `loadSkillsFn:` argument without reading those.
   const { startup } = await import('./startup.js');
-  await startup({ bootstrapFn: bootstrap, stopFn: stop });
+  await startup({
+    bootstrapFn: bootstrap,
+    stopFn: stop,
+    loadSkillsFn: loadSkillsFromDisk,
+  });
 }

2.3. src/startup.ts:278–291 (comment correction)

     // P0.6.2: load ε Skill Registry from disk into the `skills` table.
     // Runs synchronously after initDb so the table is populated before any
     // MCP tool call arrives. Parse errors are logged + skipped (not fatal).
-    // Dynamic import avoids the server.ts → startup.ts → repository.ts →
-    // server.ts circular-static-import deadlock (same pattern as loadServerModule).
+    // Resolution priority:
+    //   1. options.loadSkillsFn — production AND tests both inject here.
+    //      Production path: server.ts script-invocation site forwards
+    //      `loadSkillsFromDisk` (statically imported there) per the
+    //      P0.6.2 hotfix; see docs/contracts/p0-6-2-skills-deadlock-contract.md.
+    //      Test path: jest mocks bypass disk I/O.
+    //   2. config.COLIBRI_SKILLS_ROOT-rooted dynamic import — defense-in-
+    //      depth fallback ONLY. This branch is unreachable from the
+    //      Phase 0 script-invocation path; if reached, it deadlocks against
+    //      the suspended top-level await on server.ts:606. Kept alive for
+    //      future non-script-invocation callers (none exist today).
     // Resolution chain (priority order):
     //   1. options.skillsRoot — test seam, wins when explicitly passed.
     //   2. config.COLIBRI_SKILLS_ROOT — operator override for non-cwd deploys.

(The comment above is for skillsRootPath, which is unrelated to the loadSkillsFn resolution. Both comment blocks stay; only the loadSkillsFn block is corrected.)

2.4. New file: src/__tests__/scripts/script-invocation-skills.test.ts

/**
 * P0.6.2 hotfix regression test — guards against re-introduction of the
 * server.ts ↔ repository.ts ESM deadlock that prevented production
 * `loadSkillsFromDisk` from running. See:
 *   docs/audits/p0-6-2-skills-deadlock-audit.md
 *   docs/contracts/p0-6-2-skills-deadlock-contract.md
 *
 * Test strategy: spawn the compiled `dist/server.js` as a subprocess
 * (production path), drive a minimal MCP handshake over stdio, call
 * `skill_list`, and assert `total_count > 0`.
 *
 * MUST FAIL on the pre-fix code (skills loader deadlocks; total_count = 0).
 * MUST PASS on the post-fix code (loader runs; total_count = on-disk count).
 *
 * Acceptance criterion A-7 from the contract: this test exercises the
 * production injection path, NOT the `options.loadSkillsFn` test seam.
 */

import { spawn } from 'node:child_process';
import { existsSync } from 'node:fs';
import { resolve } from 'node:path';

const REPO_ROOT = resolve(__dirname, '..', '..', '..');
const DIST_SERVER = resolve(REPO_ROOT, 'dist', 'server.js');
const SKILLS_ROOT = resolve(REPO_ROOT, '.agents', 'skills');

const HANDSHAKE_TIMEOUT_MS = 8000;

function jsonRpc(id: number, method: string, params: object = {}): string {
  return JSON.stringify({ jsonrpc: '2.0', id, method, params });
}

function notification(method: string, params: object = {}): string {
  return JSON.stringify({ jsonrpc: '2.0', method, params });
}

interface SkillListResult {
  ok: boolean;
  data?: { skills: unknown[]; total_count: number };
}

describe('script-invocation skills load (P0.6.2 hotfix regression)', () => {
  // Skip the test if the build artifact is absent — clearer failure than
  // a confusing spawn error. Run `npm run build` first.
  const distExists = existsSync(DIST_SERVER);
  const skillsExist = existsSync(SKILLS_ROOT);

  (distExists && skillsExist ? it : it.skip)(
    'spawns dist/server.js, calls skill_list, sees total_count > 0',
    async () => {
      const child = spawn('node', [DIST_SERVER], {
        env: {
          ...process.env,
          COLIBRI_DB_PATH: ':memory:',
          COLIBRI_SKILLS_ROOT: SKILLS_ROOT,
          COLIBRI_MODE: 'FULL',
          NODE_ENV: 'production',
        },
        stdio: ['pipe', 'pipe', 'pipe'],
      });

      let stdoutBuf = '';
      const responses: Record<number, unknown> = {};

      child.stdout.on('data', (chunk: Buffer) => {
        stdoutBuf += chunk.toString('utf8');
        // Parse complete JSON-RPC frames separated by newlines.
        let nl: number;
        while ((nl = stdoutBuf.indexOf('\n')) >= 0) {
          const line = stdoutBuf.slice(0, nl).trim();
          stdoutBuf = stdoutBuf.slice(nl + 1);
          if (line.length === 0) continue;
          try {
            const msg = JSON.parse(line) as { id?: number };
            if (typeof msg.id === 'number') {
              responses[msg.id] = msg;
            }
          } catch {
            /* not a complete JSON-RPC frame — keep accumulating */
          }
        }
      });

      // Send initialize → wait for response → send initialized notif → call skill_list.
      child.stdin.write(
        jsonRpc(1, 'initialize', {
          protocolVersion: '2024-11-05',
          capabilities: {},
          clientInfo: { name: 'p0-6-2-regression', version: '1.0' },
        }) + '\n',
      );

      const start = Date.now();
      while (responses[1] === undefined) {
        if (Date.now() - start > HANDSHAKE_TIMEOUT_MS) {
          child.kill('SIGKILL');
          throw new Error('initialize handshake timed out');
        }
        await new Promise((r) => setTimeout(r, 50));
      }

      child.stdin.write(notification('notifications/initialized') + '\n');
      child.stdin.write(jsonRpc(2, 'tools/call', { name: 'skill_list', arguments: {} }) + '\n');

      const tStart = Date.now();
      while (responses[2] === undefined) {
        if (Date.now() - tStart > HANDSHAKE_TIMEOUT_MS) {
          child.kill('SIGKILL');
          throw new Error('skill_list call timed out');
        }
        await new Promise((r) => setTimeout(r, 50));
      }

      child.stdin.end();
      child.kill('SIGTERM');

      const skillListResp = responses[2] as {
        result?: { structuredContent?: SkillListResult };
      };
      const structured = skillListResp.result?.structuredContent;
      expect(structured).toBeDefined();
      expect(structured?.ok).toBe(true);
      expect(structured?.data).toBeDefined();
      // The exact count is corpus-dependent; what matters is that production
      // path actually ran the loader. Pre-fix: 0. Post-fix: 23 today.
      expect(structured!.data!.total_count).toBeGreaterThan(0);
      expect(Array.isArray(structured!.data!.skills)).toBe(true);
      expect(structured!.data!.skills.length).toBe(structured!.data!.total_count);
    },
    HANDSHAKE_TIMEOUT_MS * 2 + 5000,
  );
});

2.5. Test directory creation

src/__tests__/scripts/ does not exist yet. The new test file’s path implicitly creates it. No package.json, tsconfig.json, or jest config changes are required — the existing Jest config picks up src/__tests__/**/*.test.ts.

3. Pre-flight checklist (before editing)

  • cd .worktrees/claude/p0-6-2-skills-deadlock-fix
  • Confirm worktree HEAD = origin/main 6345ba7a
  • npm ci (worktree may not have node_modules)
  • npm run build baseline — confirm clean tsc + 6 migrations copied
  • npm test -- --listTests | grep -c '\.test\.' — record current test-file count for delta confirmation

4. Execution order

  1. Edit src/server.ts:50 (import line)
  2. Edit src/server.ts:600–606 (script-invocation block)
  3. Edit src/startup.ts:278–291 (comment correction)
  4. Create src/__tests__/scripts/script-invocation-skills.test.ts
  5. npm run build — must be clean
  6. npm run lint — must be clean
  7. npm test — must be 1409 + 1 = 1410 passing (no regressions)
  8. Standalone smoke from worktree root with patched dist/:
    COLIBRI_DB_PATH=":memory:" COLIBRI_SKILLS_ROOT="$(pwd)/.agents/skills" \
      COLIBRI_MODE=FULL NODE_ENV=production \
      timeout 5 node dist/server.js < /dev/null 2>&1 | grep -E '\[Startup\] Skills:'
    

    MUST print [Startup] Skills: loaded=23, skipped=0, pruned=0.

  9. Commit on feature/p0-6-2-skills-deadlock-fix.

5. Commit message (draft)

fix(skills): resolve loadSkillsFromDisk statically to avoid Phase 2 ESM deadlock

The startup.ts:298 dynamic `await import('./domains/skills/repository.js')`
deadlocks against the script-invocation top-level await on server.ts:606
because repository.ts is in a static cycle with server.ts (via line 36's
import of registerColibriTool). Production servers booted via Claude Desktop
(or any direct `node dist/server.js` invocation) hit Phase 2: heavy-init...
then hang silently — Phase 1 stays responsive (server_ping, server_health
work) but skill_count stays 0 forever.

Tests passed at 1409/1409 because Jest never enters the script-invocation
IIFE and the startup-suite injects options.loadSkillsFn as a mock — so
the dynamic-import branch never fired in either path that ran in CI.

Fix: pre-resolve loadSkillsFromDisk statically at the script-invocation
site (alongside the existing static import of registerSkillTools on
server.ts:50), and forward it through startup() options. The
options.loadSkillsFn ?? (await import...) chain in startup.ts:296-298
already supports this — production now uses the same seam tests do.

Adds src/__tests__/scripts/script-invocation-skills.test.ts: spawns the
compiled dist/server.js as a subprocess, drives the real MCP handshake,
calls skill_list, asserts total_count > 0. Fails on pre-fix code, passes
on post-fix code. Per contract A-7, this test exercises the production
injection path, not the test seam — exactly the gap that let the bug
ship.

Also corrects the misleading comment at startup.ts:281-282 that claimed
the dynamic import avoids the deadlock; it doesn't. The fallback branch
stays as defense-in-depth for hypothetical future non-script-invocation
callers.

task: 109b31c3-6419-4019-a14a-2fb1fb26f355
audit: docs/audits/p0-6-2-skills-deadlock-audit.md
contract: docs/contracts/p0-6-2-skills-deadlock-contract.md
packet: docs/packets/p0-6-2-skills-deadlock-packet.md

6. Deploy + final verify (after PR merge — recorded for completeness)

After this packet’s verification step lands and the PR is merged to main:

  1. Pull main, npm run build from main checkout.
  2. Tray-quit Claude Desktop.
  3. Relaunch.
  4. mcp__colibri__server_healthskill_count: 23.
  5. mcp__colibri__skill_listtotal_count: 23, populated skills[].

7. Rollback

If the patch breaks something we didn’t foresee:

git revert <commit-sha>
npm run build
# tray-quit + relaunch

startup.ts:296-298 retains the dynamic-import fallback, so reverting the server.ts change leaves a working (deadlocking) server — same as pre-fix state. No data loss; colibri.db schema is untouched.

8. Goes to Implement step (APPLY)

Section 2 above is the literal patch. Section 4 is the order. Step 4 implements; Step 5 verifies against contract A-1 through A-7.


Back to top

Colibri — documentation-first MCP runtime. Apache 2.0 + Commons Clause.

This site uses Just the Docs, a documentation theme for Jekyll.