R83.C / P1.2.1 — Lexer retry — Fix Packet
Continuation of the R81.B 5-step chain. This packet gates the R83.C fix commit. See the root-cause analysis in
docs/audits/r83-c-lexer-retry-debug-audit.md.
§FP1. Plan at a glance
Exactly one surgical edit to src/domains/rules/lexer.ts plus two small test additions — total diff ≈ 40 lines. The fix replaces the Identifier TokenType’s regex pattern with a custom pattern function to bypass Chevrotain 11.0.3’s broken Unicode-u-flag optimiser. Every other token definition, every export, every doc-comment remains untouched.
§FP2. Target files
| File | Edit | Lines touched |
|---|---|---|
src/domains/rules/lexer.ts |
Replace Identifier token’s pattern + add module-level sticky regex constant. |
~20 lines around §2 |
src/__tests__/domains/rules/lexer.test.ts |
Add 2 regression tests (long ASCII ident, CJK ident boundary). | ~10 lines at end of Group I |
No changes to package.json, package-lock.json, or any other file. The chevrotain dep is already pinned at 11.0.3.
§FP3. The Identifier token — before and after
§FP3.1. Current (broken)
// src/domains/rules/lexer.ts §2 — Identifier + Variable
const Identifier: TokenType = createToken({
name: 'Identifier',
pattern: /[\p{XID_Start}_][\p{XID_Continue}]*/u,
});
Chevrotain’s regex optimiser mis-parses \p{XID_Start} as the bracketed ASCII literal set [XIDStart_], poisons the dispatch table, and causes 21 tests to fail.
§FP3.2. Replacement (fix)
// src/domains/rules/lexer.ts §2 — Identifier + Variable
/**
* Module-level sticky Unicode regex for Identifier matching.
*
* Chevrotain 11.0.3's internal regex-to-ast optimiser does NOT support the `u`
* flag, which silently corrupts the dispatch table for any TokenType whose
* PATTERN uses Unicode property escapes. See the debug audit
* (`docs/audits/r83-c-lexer-retry-debug-audit.md`) for the full root-cause
* analysis and the K2-test "AcceptCommitment" → "ptCo" smoking-gun diagnostic.
*
* The fix: express Identifier via a custom pattern function (Chevrotain's
* documented escape hatch for Unicode patterns). The function delegates to
* this sticky regex which re-anchors at `startOffset` on every call and is
* fully Unicode-aware.
*/
const IDENTIFIER_REGEX = /[\p{XID_Start}_][\p{XID_Continue}]*/uy;
const Identifier: TokenType = createToken({
name: 'Identifier',
pattern: (text: string, startOffset: number): [string] | null => {
IDENTIFIER_REGEX.lastIndex = startOffset;
const match = IDENTIFIER_REGEX.exec(text);
return match === null ? null : [match[0]];
},
// Identifier patterns never contain a line break, so Chevrotain can skip
// the (optimiser-invoking) line-break probe and use the simple length delta
// for position tracking. This is the idiomatic setting for custom matchers.
line_breaks: false,
});
Three shape changes:
- Add a module-level constant
IDENTIFIER_REGEXwith theuy(Unicode + sticky) flags. - Change
Identifier’spatternfrom regex literal to a function. - Set
line_breaks: falseon theIdentifierTokenType.
Everything else — the token’s name, its position in allTokens, its export via Literals, its use as longer_alt on the 18 keywords — is byte-identical.
§FP4. New regression tests
Two tests appended at the end of describe('I. Unicode identifiers', …), plus a long-ASCII test in Group C:
// Group C — after C5
test('C6 — long ASCII identifier "AcceptCommitment" tokenizes as one Identifier (regression for K2)', () => {
const r = tokenize('AcceptCommitment');
expect(r.errors).toEqual([]);
expect(r.tokens).toHaveLength(1);
expect(r.tokens[0]!.tokenType).toBe(Identifier);
expect(r.tokens[0]!.image).toBe('AcceptCommitment');
});
// Group I — after I3
test('I4 — 16-char CJK identifier tokenizes as one Identifier without breaks', () => {
const r = tokenize('日本語日本語日本語日本語日本語日本語');
expect(r.errors).toEqual([]);
expect(r.tokens).toHaveLength(1);
expect(r.tokens[0]!.image.length).toBe(18);
});
These two tests pin the fix — they specifically exercise the 16-char ASCII and long-Unicode cases that the broken optimiser previously mangled. If a future Chevrotain upgrade or a lexer refactor regresses the fix, C6 and I4 catch it.
§FP5. Post-fix test count
Current baseline on main @ 657d4ef4: 1123 tests passing (post-R81 Wave 1 at #174).
Current lexer suite on feature/r81-b-p1-2-1-lexer @ 1eec2b7c: 82 total, 21 fail, 61 pass.
After the fix: 82 → 84 lexer tests (all passing) thanks to the +2 regressions.
Expected full-suite count post-R83.C: 1123 + 84 = 1207 tests. (The R81.B feat commit itself brings 82 new lexer tests on top of the 1123 baseline; 2 more are added by R83.C.)
Note on R81 memory’s claim of “1123 baseline, lexer tests target 40+”: the R81.B feat commit already added 82 tests — above the 40 target — but they all fail/pass in the pattern described above. The “1123” figure in the R81 memory refers to the pre-lexer-suite baseline.
§FP6. Order of operations for the fix commit
- Edit
src/domains/rules/lexer.tsper §FP3.2 — swap Identifier pattern, addIDENTIFIER_REGEXconst. - Edit
src/__tests__/domains/rules/lexer.test.tsper §FP4 — add C6 and I4. npm run build— confirm TypeScript compiles with the function-pattern signature.npm run lint— confirm no eslint complaints; the pattern-function signature uses explicitstringandnumberparameters, noany.npm test -- --testPathPattern=rules/lexer— confirm 84/84 lexer tests pass.npm test— confirm the full 1207-test suite passes (pre-existing subprocess-smoke flake may need one re-run; unrelated).- Single commit:
fix(r83-c-lexer-retry): swap Identifier to custom pattern fn (Chevrotain 11.0.3 /u flag workaround).
Do not amend the R81.B feat commit. The fix is a fresh commit on top — cleaner history, preserves the 5-step chain, and keeps blame on the R83.C layer.
§FP7. Acceptance criteria (gating Step 5)
npm run buildgreen.npm run lintgreen — noany, noeslint-disablecomments.npm test -- --testPathPattern=rules/lexer→ 84 passed, 0 failed.- Full
npm test→ 1207 passed. (Pre-existing subprocess-smoke flake permits one re-run.) Identifiertoken’spatternis a function that delegates to theIDENTIFIER_REGEXsticky constant.line_breaks: falseis set onIdentifier.- No file other than
src/domains/rules/lexer.tsandsrc/__tests__/domains/rules/lexer.test.tsis touched in the fix commit. - Public API of the lexer module (
tokenize,Keywords,Operators,Delimiters,Literals,allTokens,FLOAT_REJECTED_MESSAGE,UNDERSCORE_INT_REJECTED_MESSAGE) is unchanged.
§FP8. Risk register (fix-specific, supersedes §P7 of the R81.B packet)
| Risk | Mitigation |
|---|---|
| Function-pattern signature typing issue in TS strict | Returning [string] \| null is the declared Chevrotain CustomPatternMatcherFunc alternate return type. Explicit types; compile-checked. |
Sticky regex lastIndex not reset |
Every call reassigns lastIndex = startOffset. Module-level regex so no GC churn. |
| Unicode warnings still printed by Chevrotain’s console | Not an issue — the only TokenType using u flag was Identifier, which now has no PATTERN regex for Chevrotain to analyze. No more warning. |
| Performance regression from function call per dispatch | Single native regex .exec() per candidate; empirically immeasurable on the 82+ test suite. |
longer_alt: Identifier keyword ordering |
Already tested by C1–C5 (+ new C6). The keyword tokens still walk to the Identifier matcher when their \b-bounded prefix continues. No logic change. |
§FP9. Out of scope (re-stated)
- Adding a second ADR, upgrading Chevrotain, rewriting any other token, touching integer-math.ts (R81.A), spawning parser/AST/interpreter. All deferred to Wave 3 or beyond.
§FP10. Summary
One regex → function swap, with a module-level sticky /uy constant backing the Identifier matcher. 2 regression tests pin the fix. The 21 failing tests become 21 passing; total lexer suite reaches 84/84 green. Full-suite total reaches 1207.
Next step: fix commit (code edit + tests), then verification doc.