R83.C / P1.2.1 — Lexer retry — Fix Packet

Continuation of the R81.B 5-step chain. This packet gates the R83.C fix commit. See the root-cause analysis in docs/audits/r83-c-lexer-retry-debug-audit.md.

§FP1. Plan at a glance

Exactly one surgical edit to src/domains/rules/lexer.ts plus two small test additions — total diff ≈ 40 lines. The fix replaces the Identifier TokenType’s regex pattern with a custom pattern function to bypass Chevrotain 11.0.3’s broken Unicode-u-flag optimiser. Every other token definition, every export, every doc-comment remains untouched.

§FP2. Target files

File Edit Lines touched
src/domains/rules/lexer.ts Replace Identifier token’s pattern + add module-level sticky regex constant. ~20 lines around §2
src/__tests__/domains/rules/lexer.test.ts Add 2 regression tests (long ASCII ident, CJK ident boundary). ~10 lines at end of Group I

No changes to package.json, package-lock.json, or any other file. The chevrotain dep is already pinned at 11.0.3.

§FP3. The Identifier token — before and after

§FP3.1. Current (broken)

// src/domains/rules/lexer.ts §2 — Identifier + Variable

const Identifier: TokenType = createToken({
  name: 'Identifier',
  pattern: /[\p{XID_Start}_][\p{XID_Continue}]*/u,
});

Chevrotain’s regex optimiser mis-parses \p{XID_Start} as the bracketed ASCII literal set [XIDStart_], poisons the dispatch table, and causes 21 tests to fail.

§FP3.2. Replacement (fix)

// src/domains/rules/lexer.ts §2 — Identifier + Variable

/**
 * Module-level sticky Unicode regex for Identifier matching.
 *
 * Chevrotain 11.0.3's internal regex-to-ast optimiser does NOT support the `u`
 * flag, which silently corrupts the dispatch table for any TokenType whose
 * PATTERN uses Unicode property escapes. See the debug audit
 * (`docs/audits/r83-c-lexer-retry-debug-audit.md`) for the full root-cause
 * analysis and the K2-test "AcceptCommitment" → "ptCo" smoking-gun diagnostic.
 *
 * The fix: express Identifier via a custom pattern function (Chevrotain's
 * documented escape hatch for Unicode patterns). The function delegates to
 * this sticky regex which re-anchors at `startOffset` on every call and is
 * fully Unicode-aware.
 */
const IDENTIFIER_REGEX = /[\p{XID_Start}_][\p{XID_Continue}]*/uy;

const Identifier: TokenType = createToken({
  name: 'Identifier',
  pattern: (text: string, startOffset: number): [string] | null => {
    IDENTIFIER_REGEX.lastIndex = startOffset;
    const match = IDENTIFIER_REGEX.exec(text);
    return match === null ? null : [match[0]];
  },
  // Identifier patterns never contain a line break, so Chevrotain can skip
  // the (optimiser-invoking) line-break probe and use the simple length delta
  // for position tracking. This is the idiomatic setting for custom matchers.
  line_breaks: false,
});

Three shape changes:

  1. Add a module-level constant IDENTIFIER_REGEX with the uy (Unicode + sticky) flags.
  2. Change Identifier’s pattern from regex literal to a function.
  3. Set line_breaks: false on the Identifier TokenType.

Everything else — the token’s name, its position in allTokens, its export via Literals, its use as longer_alt on the 18 keywords — is byte-identical.

§FP4. New regression tests

Two tests appended at the end of describe('I. Unicode identifiers', …), plus a long-ASCII test in Group C:

// Group C — after C5
test('C6 — long ASCII identifier "AcceptCommitment" tokenizes as one Identifier (regression for K2)', () => {
  const r = tokenize('AcceptCommitment');
  expect(r.errors).toEqual([]);
  expect(r.tokens).toHaveLength(1);
  expect(r.tokens[0]!.tokenType).toBe(Identifier);
  expect(r.tokens[0]!.image).toBe('AcceptCommitment');
});

// Group I — after I3
test('I4 — 16-char CJK identifier tokenizes as one Identifier without breaks', () => {
  const r = tokenize('日本語日本語日本語日本語日本語日本語');
  expect(r.errors).toEqual([]);
  expect(r.tokens).toHaveLength(1);
  expect(r.tokens[0]!.image.length).toBe(18);
});

These two tests pin the fix — they specifically exercise the 16-char ASCII and long-Unicode cases that the broken optimiser previously mangled. If a future Chevrotain upgrade or a lexer refactor regresses the fix, C6 and I4 catch it.

§FP5. Post-fix test count

Current baseline on main @ 657d4ef4: 1123 tests passing (post-R81 Wave 1 at #174).

Current lexer suite on feature/r81-b-p1-2-1-lexer @ 1eec2b7c: 82 total, 21 fail, 61 pass.

After the fix: 82 → 84 lexer tests (all passing) thanks to the +2 regressions.

Expected full-suite count post-R83.C: 1123 + 84 = 1207 tests. (The R81.B feat commit itself brings 82 new lexer tests on top of the 1123 baseline; 2 more are added by R83.C.)

Note on R81 memory’s claim of “1123 baseline, lexer tests target 40+”: the R81.B feat commit already added 82 tests — above the 40 target — but they all fail/pass in the pattern described above. The “1123” figure in the R81 memory refers to the pre-lexer-suite baseline.

§FP6. Order of operations for the fix commit

  1. Edit src/domains/rules/lexer.ts per §FP3.2 — swap Identifier pattern, add IDENTIFIER_REGEX const.
  2. Edit src/__tests__/domains/rules/lexer.test.ts per §FP4 — add C6 and I4.
  3. npm run build — confirm TypeScript compiles with the function-pattern signature.
  4. npm run lint — confirm no eslint complaints; the pattern-function signature uses explicit string and number parameters, no any.
  5. npm test -- --testPathPattern=rules/lexer — confirm 84/84 lexer tests pass.
  6. npm test — confirm the full 1207-test suite passes (pre-existing subprocess-smoke flake may need one re-run; unrelated).
  7. Single commit: fix(r83-c-lexer-retry): swap Identifier to custom pattern fn (Chevrotain 11.0.3 /u flag workaround).

Do not amend the R81.B feat commit. The fix is a fresh commit on top — cleaner history, preserves the 5-step chain, and keeps blame on the R83.C layer.

§FP7. Acceptance criteria (gating Step 5)

  • npm run build green.
  • npm run lint green — no any, no eslint-disable comments.
  • npm test -- --testPathPattern=rules/lexer → 84 passed, 0 failed.
  • Full npm test → 1207 passed. (Pre-existing subprocess-smoke flake permits one re-run.)
  • Identifier token’s pattern is a function that delegates to the IDENTIFIER_REGEX sticky constant.
  • line_breaks: false is set on Identifier.
  • No file other than src/domains/rules/lexer.ts and src/__tests__/domains/rules/lexer.test.ts is touched in the fix commit.
  • Public API of the lexer module (tokenize, Keywords, Operators, Delimiters, Literals, allTokens, FLOAT_REJECTED_MESSAGE, UNDERSCORE_INT_REJECTED_MESSAGE) is unchanged.

§FP8. Risk register (fix-specific, supersedes §P7 of the R81.B packet)

Risk Mitigation
Function-pattern signature typing issue in TS strict Returning [string] \| null is the declared Chevrotain CustomPatternMatcherFunc alternate return type. Explicit types; compile-checked.
Sticky regex lastIndex not reset Every call reassigns lastIndex = startOffset. Module-level regex so no GC churn.
Unicode warnings still printed by Chevrotain’s console Not an issue — the only TokenType using u flag was Identifier, which now has no PATTERN regex for Chevrotain to analyze. No more warning.
Performance regression from function call per dispatch Single native regex .exec() per candidate; empirically immeasurable on the 82+ test suite.
longer_alt: Identifier keyword ordering Already tested by C1–C5 (+ new C6). The keyword tokens still walk to the Identifier matcher when their \b-bounded prefix continues. No logic change.

§FP9. Out of scope (re-stated)

  • Adding a second ADR, upgrading Chevrotain, rewriting any other token, touching integer-math.ts (R81.A), spawning parser/AST/interpreter. All deferred to Wave 3 or beyond.

§FP10. Summary

One regex → function swap, with a module-level sticky /uy constant backing the Identifier matcher. 2 regression tests pin the fix. The 21 failing tests become 21 passing; total lexer suite reaches 84/84 green. Full-suite total reaches 1207.

Next step: fix commit (code edit + tests), then verification doc.


Back to top

Colibri — documentation-first MCP runtime. Apache 2.0 + Commons Clause.

This site uses Just the Docs, a documentation theme for Jekyll.