P0.6.2 Skills-Loader ESM Deadlock — Packet
Task: 109b31c3-6419-4019-a14a-2fb1fb26f355
Audit: docs/audits/p0-6-2-skills-deadlock-audit.md
Contract: docs/contracts/p0-6-2-skills-deadlock-contract.md
ζ thought (decision): 189f598a…5adbf (chained off analysis 2e7e56f9…c47d)
1. Patch ledger
Three files modified, one file created.
| Action | Path | Lines | Purpose |
|---|---|---|---|
| modify | src/server.ts |
50 | Add loadSkillsFromDisk to the existing static import from ./domains/skills/repository.js |
| modify | src/server.ts |
600–606 | Forward loadSkillsFn: loadSkillsFromDisk through startup() options + add anchor comment citing this packet |
| modify | src/startup.ts |
281–291 | Replace incorrect “Dynamic import avoids the deadlock” comment with truth (deadlock is real; production must pre-resolve) |
| create | src/__tests__/scripts/script-invocation-skills.test.ts |
new | Subprocess-based regression test |
No edits to: src/domains/skills/repository.ts, src/config.ts, any other domain, env-var schema, MCP tool schemas, package.json, ESLint/TS config, migrations.
2. Exact text changes
2.1. src/server.ts:50
- import { registerSkillTools } from './domains/skills/repository.js';
+ import { loadSkillsFromDisk, registerSkillTools } from './domains/skills/repository.js';
Alphabetic order preserved (l < r).
2.2. src/server.ts:600–606
//
// We explicitly forward `bootstrap` + `stop` through startup's options
// bag so `startup()` does not need to re-import `./server.js` at
// runtime; re-entry during this IIFE's evaluation would deadlock Node's
// ES-module loader and trigger an "unsettled top-level await" diagnostic.
+ //
+ // P0.6.2 hotfix (post-R83): we ALSO forward `loadSkillsFromDisk` via
+ // `loadSkillsFn` for the same reason. The dynamic `await import` inside
+ // `startup()` Phase 2 (startup.ts:296-298) deadlocks against this very
+ // top-level await because `repository.js` participates in the
+ // `server.ts ↔ repository.ts` static cycle. Pre-resolving the loader
+ // here, while server.ts is still evaluating but `repository.js`'s
+ // exports are already populated (it was statically imported on line 50),
+ // avoids the loader re-entry. See:
+ // docs/audits/p0-6-2-skills-deadlock-audit.md
+ // docs/contracts/p0-6-2-skills-deadlock-contract.md
+ // DO NOT remove the `loadSkillsFn:` argument without reading those.
const { startup } = await import('./startup.js');
- await startup({ bootstrapFn: bootstrap, stopFn: stop });
+ await startup({
+ bootstrapFn: bootstrap,
+ stopFn: stop,
+ loadSkillsFn: loadSkillsFromDisk,
+ });
}
2.3. src/startup.ts:278–291 (comment correction)
// P0.6.2: load ε Skill Registry from disk into the `skills` table.
// Runs synchronously after initDb so the table is populated before any
// MCP tool call arrives. Parse errors are logged + skipped (not fatal).
- // Dynamic import avoids the server.ts → startup.ts → repository.ts →
- // server.ts circular-static-import deadlock (same pattern as loadServerModule).
+ // Resolution priority:
+ // 1. options.loadSkillsFn — production AND tests both inject here.
+ // Production path: server.ts script-invocation site forwards
+ // `loadSkillsFromDisk` (statically imported there) per the
+ // P0.6.2 hotfix; see docs/contracts/p0-6-2-skills-deadlock-contract.md.
+ // Test path: jest mocks bypass disk I/O.
+ // 2. config.COLIBRI_SKILLS_ROOT-rooted dynamic import — defense-in-
+ // depth fallback ONLY. This branch is unreachable from the
+ // Phase 0 script-invocation path; if reached, it deadlocks against
+ // the suspended top-level await on server.ts:606. Kept alive for
+ // future non-script-invocation callers (none exist today).
// Resolution chain (priority order):
// 1. options.skillsRoot — test seam, wins when explicitly passed.
// 2. config.COLIBRI_SKILLS_ROOT — operator override for non-cwd deploys.
(The comment above is for skillsRootPath, which is unrelated to the loadSkillsFn resolution. Both comment blocks stay; only the loadSkillsFn block is corrected.)
2.4. New file: src/__tests__/scripts/script-invocation-skills.test.ts
/**
* P0.6.2 hotfix regression test — guards against re-introduction of the
* server.ts ↔ repository.ts ESM deadlock that prevented production
* `loadSkillsFromDisk` from running. See:
* docs/audits/p0-6-2-skills-deadlock-audit.md
* docs/contracts/p0-6-2-skills-deadlock-contract.md
*
* Test strategy: spawn the compiled `dist/server.js` as a subprocess
* (production path), drive a minimal MCP handshake over stdio, call
* `skill_list`, and assert `total_count > 0`.
*
* MUST FAIL on the pre-fix code (skills loader deadlocks; total_count = 0).
* MUST PASS on the post-fix code (loader runs; total_count = on-disk count).
*
* Acceptance criterion A-7 from the contract: this test exercises the
* production injection path, NOT the `options.loadSkillsFn` test seam.
*/
import { spawn } from 'node:child_process';
import { existsSync } from 'node:fs';
import { resolve } from 'node:path';
const REPO_ROOT = resolve(__dirname, '..', '..', '..');
const DIST_SERVER = resolve(REPO_ROOT, 'dist', 'server.js');
const SKILLS_ROOT = resolve(REPO_ROOT, '.agents', 'skills');
const HANDSHAKE_TIMEOUT_MS = 8000;
function jsonRpc(id: number, method: string, params: object = {}): string {
return JSON.stringify({ jsonrpc: '2.0', id, method, params });
}
function notification(method: string, params: object = {}): string {
return JSON.stringify({ jsonrpc: '2.0', method, params });
}
interface SkillListResult {
ok: boolean;
data?: { skills: unknown[]; total_count: number };
}
describe('script-invocation skills load (P0.6.2 hotfix regression)', () => {
// Skip the test if the build artifact is absent — clearer failure than
// a confusing spawn error. Run `npm run build` first.
const distExists = existsSync(DIST_SERVER);
const skillsExist = existsSync(SKILLS_ROOT);
(distExists && skillsExist ? it : it.skip)(
'spawns dist/server.js, calls skill_list, sees total_count > 0',
async () => {
const child = spawn('node', [DIST_SERVER], {
env: {
...process.env,
COLIBRI_DB_PATH: ':memory:',
COLIBRI_SKILLS_ROOT: SKILLS_ROOT,
COLIBRI_MODE: 'FULL',
NODE_ENV: 'production',
},
stdio: ['pipe', 'pipe', 'pipe'],
});
let stdoutBuf = '';
const responses: Record<number, unknown> = {};
child.stdout.on('data', (chunk: Buffer) => {
stdoutBuf += chunk.toString('utf8');
// Parse complete JSON-RPC frames separated by newlines.
let nl: number;
while ((nl = stdoutBuf.indexOf('\n')) >= 0) {
const line = stdoutBuf.slice(0, nl).trim();
stdoutBuf = stdoutBuf.slice(nl + 1);
if (line.length === 0) continue;
try {
const msg = JSON.parse(line) as { id?: number };
if (typeof msg.id === 'number') {
responses[msg.id] = msg;
}
} catch {
/* not a complete JSON-RPC frame — keep accumulating */
}
}
});
// Send initialize → wait for response → send initialized notif → call skill_list.
child.stdin.write(
jsonRpc(1, 'initialize', {
protocolVersion: '2024-11-05',
capabilities: {},
clientInfo: { name: 'p0-6-2-regression', version: '1.0' },
}) + '\n',
);
const start = Date.now();
while (responses[1] === undefined) {
if (Date.now() - start > HANDSHAKE_TIMEOUT_MS) {
child.kill('SIGKILL');
throw new Error('initialize handshake timed out');
}
await new Promise((r) => setTimeout(r, 50));
}
child.stdin.write(notification('notifications/initialized') + '\n');
child.stdin.write(jsonRpc(2, 'tools/call', { name: 'skill_list', arguments: {} }) + '\n');
const tStart = Date.now();
while (responses[2] === undefined) {
if (Date.now() - tStart > HANDSHAKE_TIMEOUT_MS) {
child.kill('SIGKILL');
throw new Error('skill_list call timed out');
}
await new Promise((r) => setTimeout(r, 50));
}
child.stdin.end();
child.kill('SIGTERM');
const skillListResp = responses[2] as {
result?: { structuredContent?: SkillListResult };
};
const structured = skillListResp.result?.structuredContent;
expect(structured).toBeDefined();
expect(structured?.ok).toBe(true);
expect(structured?.data).toBeDefined();
// The exact count is corpus-dependent; what matters is that production
// path actually ran the loader. Pre-fix: 0. Post-fix: 23 today.
expect(structured!.data!.total_count).toBeGreaterThan(0);
expect(Array.isArray(structured!.data!.skills)).toBe(true);
expect(structured!.data!.skills.length).toBe(structured!.data!.total_count);
},
HANDSHAKE_TIMEOUT_MS * 2 + 5000,
);
});
2.5. Test directory creation
src/__tests__/scripts/ does not exist yet. The new test file’s path implicitly creates it. No package.json, tsconfig.json, or jest config changes are required — the existing Jest config picks up src/__tests__/**/*.test.ts.
3. Pre-flight checklist (before editing)
cd .worktrees/claude/p0-6-2-skills-deadlock-fix- Confirm worktree HEAD = origin/main
6345ba7a npm ci(worktree may not have node_modules)npm run buildbaseline — confirm clean tsc + 6 migrations copiednpm test -- --listTests | grep -c '\.test\.'— record current test-file count for delta confirmation
4. Execution order
- Edit
src/server.ts:50(import line) - Edit
src/server.ts:600–606(script-invocation block) - Edit
src/startup.ts:278–291(comment correction) - Create
src/__tests__/scripts/script-invocation-skills.test.ts npm run build— must be cleannpm run lint— must be cleannpm test— must be 1409 + 1 = 1410 passing (no regressions)- Standalone smoke from worktree root with patched
dist/:COLIBRI_DB_PATH=":memory:" COLIBRI_SKILLS_ROOT="$(pwd)/.agents/skills" \ COLIBRI_MODE=FULL NODE_ENV=production \ timeout 5 node dist/server.js < /dev/null 2>&1 | grep -E '\[Startup\] Skills:'MUST print
[Startup] Skills: loaded=23, skipped=0, pruned=0. - Commit on
feature/p0-6-2-skills-deadlock-fix.
5. Commit message (draft)
fix(skills): resolve loadSkillsFromDisk statically to avoid Phase 2 ESM deadlock
The startup.ts:298 dynamic `await import('./domains/skills/repository.js')`
deadlocks against the script-invocation top-level await on server.ts:606
because repository.ts is in a static cycle with server.ts (via line 36's
import of registerColibriTool). Production servers booted via Claude Desktop
(or any direct `node dist/server.js` invocation) hit Phase 2: heavy-init...
then hang silently — Phase 1 stays responsive (server_ping, server_health
work) but skill_count stays 0 forever.
Tests passed at 1409/1409 because Jest never enters the script-invocation
IIFE and the startup-suite injects options.loadSkillsFn as a mock — so
the dynamic-import branch never fired in either path that ran in CI.
Fix: pre-resolve loadSkillsFromDisk statically at the script-invocation
site (alongside the existing static import of registerSkillTools on
server.ts:50), and forward it through startup() options. The
options.loadSkillsFn ?? (await import...) chain in startup.ts:296-298
already supports this — production now uses the same seam tests do.
Adds src/__tests__/scripts/script-invocation-skills.test.ts: spawns the
compiled dist/server.js as a subprocess, drives the real MCP handshake,
calls skill_list, asserts total_count > 0. Fails on pre-fix code, passes
on post-fix code. Per contract A-7, this test exercises the production
injection path, not the test seam — exactly the gap that let the bug
ship.
Also corrects the misleading comment at startup.ts:281-282 that claimed
the dynamic import avoids the deadlock; it doesn't. The fallback branch
stays as defense-in-depth for hypothetical future non-script-invocation
callers.
task: 109b31c3-6419-4019-a14a-2fb1fb26f355
audit: docs/audits/p0-6-2-skills-deadlock-audit.md
contract: docs/contracts/p0-6-2-skills-deadlock-contract.md
packet: docs/packets/p0-6-2-skills-deadlock-packet.md
6. Deploy + final verify (after PR merge — recorded for completeness)
After this packet’s verification step lands and the PR is merged to main:
- Pull main,
npm run buildfrom main checkout. - Tray-quit Claude Desktop.
- Relaunch.
mcp__colibri__server_health→skill_count: 23.mcp__colibri__skill_list→total_count: 23, populatedskills[].
7. Rollback
If the patch breaks something we didn’t foresee:
git revert <commit-sha>
npm run build
# tray-quit + relaunch
startup.ts:296-298 retains the dynamic-import fallback, so reverting the server.ts change leaves a working (deadlocking) server — same as pre-fix state. No data loss; colibri.db schema is untouched.
8. Goes to Implement step (APPLY)
Section 2 above is the literal patch. Section 4 is the order. Step 4 implements; Step 5 verifies against contract A-1 through A-7.