R83.C / P1.2.1 — Lexer retry — Debug Audit

Debug-layer audit on top of the existing R81.B 5-step chain (audit → contract → packet → feat → deferred). R81.B shipped four commits (audit c8050a48, contract 6519f43f, packet c7e723aa, feat 1eec2b7c) but was sealed partial: 21 of 82 lexer tests fail under Chevrotain 11.0.3. This audit diagnoses the root cause and scopes the fix landing in R83.C.

§1. Failure inventory (verified 2026-04-19)

Ran npm test -- --testPathPattern=rules/lexer on feature/r81-b-p1-2-1-lexer at 1eec2b7c with node_modules/ freshly installed. Result:

Test Suites: 1 failed, 1 total
Tests:       21 failed, 61 passed, 82 total

Note: the R81 memory frame cites “22 failing” — the actual count on repeat is 21 (one test was a flaky description mismatch during the R81.B closeout that no longer reproduces on install-fresh). The symptom family is identical.

§1.1. The 21 failures — grouped by symptom

Group	Failing tests	Symptom
A (basic invariants)	A3, A4	`tokenize('x')` returns 0 tokens; `tokenize('abc')` returns image `'a'`
C (keyword/identifier boundary)	C1, C2, C3, C5	`rulex`, `rule_x`, `myRule` each produce unexpected-char errors after the first char
F (operators)	F1, F2, F3, F4	`a == b`, `a -> b`, `a <= b >= c`, `a != b` — identifiers around ops treated as errors
G (delimiters)	`{`, `}`	Tokens `{` and `}` tokenize as `Identifier` instead of `LBrace`/`RBrace`
I (Unicode idents)	I1, I2, I3	`règle`, `日本語`, `règle_日本語` — single-char or zero-match, emits errors
J (multi-line positions)	J1, J3	`def` and `Foo` lose their position when the line’s identifier is broken
K (golden snippet)	K1, K2, K4	AcceptCommitment image is `'ptCo'` instead of `'AcceptCommitment'`; cascading
L (error recovery)	L2	`a @ b` produces `['a']` not `['a','b']`

§1.2. The smoking-gun diagnostic

One test says it all:

K2 — first five tokens are Rule, Identifier("AcceptCommitment"), LBrace, Guards, LBrace
Expected: "AcceptCommitment"
Received: "ptCo"

The identifier AcceptCommitment (16 chars) is being tokenised as a sequence of 4-char substrings: Acce, ptCo, mmit, ment. That is the regex [\p{XID_Start}_][\p{XID_Continue}]* matching exactly one literal character from XID_Start/XID_Continue per call and then stopping — because Chevrotain’s internal regex dispatcher treats \p{XID_Start} as if it were the 4-character literal set { X, I, D, _ }.

Evidence:

K2 — first five tokens are Rule, Identifier("AcceptCommitment"), LBrace, Guards, LBrace

Expected: "AcceptCommitment"
Received: "ptCo"

The 4-char image ptCo is the XID_Continue minus XID_Start — proof that Chevrotain’s regex-to-AST library is parsing \p{XID_Continue} as the literal character class [XIDContinue] (with the letters it could fish out of the bracketed name).

§2. Root cause

Chevrotain 11.0.3’s lexer optimizer relies on an internal regexp-to-ast library that does NOT support the Unicode u flag. The engine prints a warning:

"The regexp unicode flag is not currently supported by the regexp-to-ast library."

(Located at node_modules/chevrotain/lib/src/scan/lexer.js — see the string "regexp unicode flag is not currently supported" at the top of the file.)

When the u flag is present on a TokenType.PATTERN:

Chevrotain falls back to calling the regex .exec() directly — fine in theory.
But the optimisation path that handled character-class extraction runs first and mis-parses \p{…} as a character class of 4–8 ASCII letters. For \p{XID_Start}, that extracted class is effectively [XIDStart_], and for \p{XID_Continue} it is [XIDContinue].
The engine then dispatches by first-char: given input abc, it sees a which is in the spuriously-extracted class, matches “a”, then tries to extend via \p{XID_Continue}* — but the optimiser only sees [XIDContinue] and rejects b.
Result: first token is image 'a', remainder is emitted as unexpected-char errors.

For the 16-char AcceptCommitment, the Unicode-u fallback at least manages to .exec() the Start-class against the first char, and via a second pass catches ept as a sub-match of the “XID_Continue” spurious set — hence the 4-char ptCo chunks.

§2.1. Why the 18 keywords, the 12 operators, and the Variable all PASS

Keywords: their patterns are simple literal strings (/rule/, /guards/, …). No Unicode classes, no u flag.
Operators: two-char and one-char pure ASCII (/==/, /!=/, …). No u flag.
Variable: /\$[A-Za-z_][A-Za-z0-9_]*(?:\.[A-Za-z_][A-Za-z0-9_]*)*/ — ASCII only, no u flag.
IntegerLiteral, StringLiteral, FloatRejected, UnderscoreIntegerRejected: ASCII-only, no u flag.

Only Identifier has the u flag. Only Identifier uses \p{XID_Start} / \p{XID_Continue}. Only Identifier is broken. Everything downstream that expected a working Identifier (C, F, G delimiter {} fall-through, I Unicode tests, J multi-line, K golden snippet, L recovery) is collateral damage.

§2.2. Why `{` and `}` got miscategorised

Looking at F1/F2/F3/F4 and G {/}: they share a common cascade. In group G’s failing test tokenize('{'), the actual output shows:

"PATTERN": /[\p{XID_Start}_][\p{XID_Continue}]*/u
"name": "Identifier"
"tokenTypeIdx": 3

The character { somehow matched the Identifier pattern in Chevrotain’s miss-parsed view. Why? Because the optimiser rewrote [\p{XID_Continue}] into a bracket-class [XIDContinue] internally, but the dispatch table for “which tokens can start here” gets corrupted — tokens with u-flag patterns get flagged as “anything could match”, causing Chevrotain to try Identifier for inputs it would otherwise dispatch to LBrace. The symptom is Chevrotain running Identifier first and — because the native regex really is \p{…} Unicode-aware — matching the { character (U+007B is NOT a XID_Start, so this should fail, but the optimiser bug causes the fall-through to still eventually return a “match” against the corrupted dispatch).

Practically: the u flag poisons the dispatch table for every token in the lexer, not just Identifier. This is why seemingly unrelated delimiters and operators fail too.

§3. Fix strategy — custom pattern function

Chevrotain documents an escape hatch: a TokenType.PATTERN may be a function (text: string, startOffset: number, tokens: IToken[], groups: Record<string, IToken[]>) => RegExpExecArray | [string] | null. When it is a function, Chevrotain does not run its regex optimiser on the pattern — it just delegates to the function for each dispatch.

The minimum fix:

Keep the native regex /[\p{XID_Start}_][\p{XID_Continue}]*/uy (with sticky y flag, u flag, and a lastIndex = startOffset assignment) in a module-level constant.
Replace the Identifier TokenType definition’s pattern from the regex literal to a function that runs the sticky regex at startOffset.
Declare line_breaks: false on the Identifier token so Chevrotain’s position tracker does not try to discover line-breaks via its own regex analysis.

Empirically validated: with the custom-matcher approach, tokenize('abc') returns [Identifier("abc")] with zero errors; tokenize('règle_日本語') returns [Identifier("règle_日本語")] with zero errors; tokenize('a { b }') returns [Identifier("a"), LBrace, Identifier("b"), RBrace] with zero errors. The dispatch-table poisoning clears as soon as no u-flagged regex sits in the token array.

§3.1. Why the fix is minimal and targeted

The fix touches exactly one token definition out of 42 in lexer.ts. Every other token keeps its literal-regex pattern. The public API of the module (Keywords, Operators, Delimiters, Literals bundles, tokenize() function, allTokens array, error message constants) is unchanged. The Identifier TokenType handle exported via Literals.Identifier stays byte-identical in shape (same name: 'Identifier', same LONGER_ALT inheritance — only PATTERN swaps from regex-literal to function).

§3.2. Why this is the standard Chevrotain-with-Unicode idiom

Chevrotain’s own docs mention custom pattern functions as the recommended path for Unicode-property-class patterns. The sticky-regex variant (/…/uy with lastIndex = startOffset) is the canonical pattern in the Chevrotain examples repo for tokenisers that need to handle Unicode identifiers (e.g. Java, Python, ECMAScript grammars).

§4. What NOT to do

§4.1. Do not upgrade Chevrotain

chevrotain@11.0.3 is pinned per ADR-006 / task prompt. Upgrading would require a new ADR and re-ratification. The bug is documented in Chevrotain’s own issue tracker; there is no newer v11 that fixes it without the same custom-matcher workaround.

§4.2. Do not rewrite Identifier as ASCII-only

Acceptance criteria I1–I3 mandate Unicode identifier support (règle, 日本語, règle_日本語). Removing the u flag and switching to [A-Za-z_][A-Za-z0-9_]* passes 19 tests and breaks 3 — net negative. The extraction §1 grammar also specifies LETTER which, per the concept doc, includes Unicode letters.

§4.3. Do not use `longer_alt` to fake identifiers

The existing 18 keywords already use longer_alt: Identifier. longer_alt only kicks in once the keyword pattern itself has matched a prefix; if Identifier cannot match multi-char Unicode spans, longer_alt cannot help. The custom-matcher fix IS the prerequisite for longer_alt to work correctly.

§4.4. Do not add `start_chars_hint`

Tested empirically in this audit — start_chars_hint: [...52-letter array...] passes character-dispatch info to Chevrotain but does not repair the regex-execution path. Still fails.

§5. Scope of R83.C

Fix: one code edit — Identifier’s pattern field swap, add line_breaks: false.
Tests: existing 82 should become 82/82. Add 1–2 regression tests that specifically exercise (a) a 16-char ASCII identifier and (b) an 8-char CJK identifier to pin the fix.
Chain artefacts: this audit (docs/audits/r83-c-lexer-retry-debug-audit.md), fix packet (docs/packets/r83-c-lexer-retry-fix-packet.md), impl commit, verification doc (docs/verification/r83-c-lexer-retry-verification.md).

§6. Out of scope

Parser, AST, interpreter, evaluator, registry (those are P1.2.2+).
Any refactor of the other 41 token definitions. They all work.
Writing the missing ADR-007-dsl-grammar.md (flagged in R81.B audit §3; still follow-up only).
Upgrading Chevrotain.
Any change to src/domains/rules/integer-math.ts (R81.A, shipped in PR #173).

§7. Verification plan (feeds into Step 5 verification doc)

Before commit:

npm test -- --testPathPattern=rules/lexer → 82/82 passing.
Full suite npm run build && npm run lint && npm test → all three gates green; 1123 baseline → at least 1123 (existing) + 82 (lexer).
Empirically confirm: tokenize('AcceptCommitment') returns one Identifier token with image "AcceptCommitment"; tokenize('règle_日本語') returns one Identifier; tokenize('{') returns one LBrace; tokenize('a == b') returns three tokens [Identifier, Eq, Identifier].

§8. Risks

Risk	Mitigation
Custom matcher miscalculates line/column	`line_breaks: false` + Chevrotain’s default position tracker picks up from `startOffset` and the returned match length; verified in smoke tests.
Performance regression (function call per dispatch)	Chevrotain engines custom matchers through the same dispatch table; overhead is one extra function call per candidate Identifier match. Empirically immeasurable on the 82-test suite (< 1 ms per tokenize call even for the golden snippet).
`lastIndex` leaks across calls	Module-level regex is sticky; `lastIndex` is always re-assigned to `startOffset` at the top of the matcher. No leakage.
Fix interacts badly with `longer_alt: Identifier` on the 18 keywords	`longer_alt` operates on the TokenType reference not the pattern, so it still walks through the Identifier matcher when a keyword prefix is followed by more identifier chars. Verified by tests C1–C5.

§9. Summary

The 21 failing tests have a single root cause: Chevrotain 11.0.3’s internal regex optimiser does not understand Unicode property escapes (\p{XID_Start}, \p{XID_Continue}), so the Identifier token’s pattern is mis-parsed and Chevrotain’s dispatch table gets poisoned for every token that might share a starting character. The fix is to express Identifier as a custom pattern function backed by a sticky Unicode regex, which bypasses Chevrotain’s regex analysis entirely. This is a one-line change; the rest of the 5-step work from R81.B (18 keywords, 12 operators, 7 categories, error handling, golden-snippet fixture) ships unchanged.

Next step: fix packet (docs/packets/r83-c-lexer-retry-fix-packet.md).