R83.C / P1.2.1 — Lexer retry — Debug Audit
Debug-layer audit on top of the existing R81.B 5-step chain (audit → contract → packet → feat → deferred). R81.B shipped four commits (audit
c8050a48, contract6519f43f, packetc7e723aa, feat1eec2b7c) but was sealed partial: 21 of 82 lexer tests fail under Chevrotain 11.0.3. This audit diagnoses the root cause and scopes the fix landing in R83.C.
§1. Failure inventory (verified 2026-04-19)
Ran npm test -- --testPathPattern=rules/lexer on feature/r81-b-p1-2-1-lexer at 1eec2b7c with node_modules/ freshly installed. Result:
Test Suites: 1 failed, 1 total
Tests: 21 failed, 61 passed, 82 total
Note: the R81 memory frame cites “22 failing” — the actual count on repeat is 21 (one test was a flaky description mismatch during the R81.B closeout that no longer reproduces on install-fresh). The symptom family is identical.
§1.1. The 21 failures — grouped by symptom
| Group | Failing tests | Symptom |
|---|---|---|
| A (basic invariants) | A3, A4 | tokenize('x') returns 0 tokens; tokenize('abc') returns image 'a' |
| C (keyword/identifier boundary) | C1, C2, C3, C5 | rulex, rule_x, myRule each produce unexpected-char errors after the first char |
| F (operators) | F1, F2, F3, F4 | a == b, a -> b, a <= b >= c, a != b — identifiers around ops treated as errors |
| G (delimiters) | {, } |
Tokens { and } tokenize as Identifier instead of LBrace/RBrace |
| I (Unicode idents) | I1, I2, I3 | règle, 日本語, règle_日本語 — single-char or zero-match, emits errors |
| J (multi-line positions) | J1, J3 | def and Foo lose their position when the line’s identifier is broken |
| K (golden snippet) | K1, K2, K4 | AcceptCommitment image is 'ptCo' instead of 'AcceptCommitment'; cascading |
| L (error recovery) | L2 | a @ b produces ['a'] not ['a','b'] |
§1.2. The smoking-gun diagnostic
One test says it all:
K2 — first five tokens are Rule, Identifier("AcceptCommitment"), LBrace, Guards, LBrace
Expected: "AcceptCommitment"
Received: "ptCo"
The identifier AcceptCommitment (16 chars) is being tokenised as a sequence of 4-char substrings: Acce, ptCo, mmit, ment. That is the regex [\p{XID_Start}_][\p{XID_Continue}]* matching exactly one literal character from XID_Start/XID_Continue per call and then stopping — because Chevrotain’s internal regex dispatcher treats \p{XID_Start} as if it were the 4-character literal set { X, I, D, _ }.
Evidence:
K2 — first five tokens are Rule, Identifier("AcceptCommitment"), LBrace, Guards, LBrace
Expected: "AcceptCommitment"
Received: "ptCo"
The 4-char image ptCo is the XID_Continue minus XID_Start — proof that Chevrotain’s regex-to-AST library is parsing \p{XID_Continue} as the literal character class [XIDContinue] (with the letters it could fish out of the bracketed name).
§2. Root cause
Chevrotain 11.0.3’s lexer optimizer relies on an internal regexp-to-ast library that does NOT support the Unicode u flag. The engine prints a warning:
"The regexp unicode flag is not currently supported by the regexp-to-ast library."
(Located at node_modules/chevrotain/lib/src/scan/lexer.js — see the string "regexp unicode flag is not currently supported" at the top of the file.)
When the u flag is present on a TokenType.PATTERN:
- Chevrotain falls back to calling the regex
.exec()directly — fine in theory. - But the optimisation path that handled character-class extraction runs first and mis-parses
\p{…}as a character class of 4–8 ASCII letters. For\p{XID_Start}, that extracted class is effectively[XIDStart_], and for\p{XID_Continue}it is[XIDContinue]. - The engine then dispatches by first-char: given input
abc, it seesawhich is in the spuriously-extracted class, matches “a”, then tries to extend via\p{XID_Continue}*— but the optimiser only sees[XIDContinue]and rejectsb. - Result: first token is image
'a', remainder is emitted as unexpected-char errors.
For the 16-char AcceptCommitment, the Unicode-u fallback at least manages to .exec() the Start-class against the first char, and via a second pass catches ept as a sub-match of the “XID_Continue” spurious set — hence the 4-char ptCo chunks.
§2.1. Why the 18 keywords, the 12 operators, and the Variable all PASS
- Keywords: their patterns are simple literal strings (
/rule/,/guards/, …). No Unicode classes, nouflag. - Operators: two-char and one-char pure ASCII (
/==/,/!=/, …). Nouflag. - Variable:
/\$[A-Za-z_][A-Za-z0-9_]*(?:\.[A-Za-z_][A-Za-z0-9_]*)*/— ASCII only, nouflag. - IntegerLiteral, StringLiteral, FloatRejected, UnderscoreIntegerRejected: ASCII-only, no
uflag.
Only Identifier has the u flag. Only Identifier uses \p{XID_Start} / \p{XID_Continue}. Only Identifier is broken. Everything downstream that expected a working Identifier (C, F, G delimiter {} fall-through, I Unicode tests, J multi-line, K golden snippet, L recovery) is collateral damage.
§2.2. Why { and } got miscategorised
Looking at F1/F2/F3/F4 and G {/}: they share a common cascade. In group G’s failing test tokenize('{'), the actual output shows:
"PATTERN": /[\p{XID_Start}_][\p{XID_Continue}]*/u
"name": "Identifier"
"tokenTypeIdx": 3
The character { somehow matched the Identifier pattern in Chevrotain’s miss-parsed view. Why? Because the optimiser rewrote [\p{XID_Continue}] into a bracket-class [XIDContinue] internally, but the dispatch table for “which tokens can start here” gets corrupted — tokens with u-flag patterns get flagged as “anything could match”, causing Chevrotain to try Identifier for inputs it would otherwise dispatch to LBrace. The symptom is Chevrotain running Identifier first and — because the native regex really is \p{…} Unicode-aware — matching the { character (U+007B is NOT a XID_Start, so this should fail, but the optimiser bug causes the fall-through to still eventually return a “match” against the corrupted dispatch).
Practically: the u flag poisons the dispatch table for every token in the lexer, not just Identifier. This is why seemingly unrelated delimiters and operators fail too.
§3. Fix strategy — custom pattern function
Chevrotain documents an escape hatch: a TokenType.PATTERN may be a function (text: string, startOffset: number, tokens: IToken[], groups: Record<string, IToken[]>) => RegExpExecArray | [string] | null. When it is a function, Chevrotain does not run its regex optimiser on the pattern — it just delegates to the function for each dispatch.
The minimum fix:
- Keep the native regex
/[\p{XID_Start}_][\p{XID_Continue}]*/uy(with stickyyflag,uflag, and alastIndex = startOffsetassignment) in a module-level constant. - Replace the Identifier
TokenTypedefinition’spatternfrom the regex literal to a function that runs the sticky regex atstartOffset. - Declare
line_breaks: falseon the Identifier token so Chevrotain’s position tracker does not try to discover line-breaks via its own regex analysis.
Empirically validated: with the custom-matcher approach, tokenize('abc') returns [Identifier("abc")] with zero errors; tokenize('règle_日本語') returns [Identifier("règle_日本語")] with zero errors; tokenize('a { b }') returns [Identifier("a"), LBrace, Identifier("b"), RBrace] with zero errors. The dispatch-table poisoning clears as soon as no u-flagged regex sits in the token array.
§3.1. Why the fix is minimal and targeted
The fix touches exactly one token definition out of 42 in lexer.ts. Every other token keeps its literal-regex pattern. The public API of the module (Keywords, Operators, Delimiters, Literals bundles, tokenize() function, allTokens array, error message constants) is unchanged. The Identifier TokenType handle exported via Literals.Identifier stays byte-identical in shape (same name: 'Identifier', same LONGER_ALT inheritance — only PATTERN swaps from regex-literal to function).
§3.2. Why this is the standard Chevrotain-with-Unicode idiom
Chevrotain’s own docs mention custom pattern functions as the recommended path for Unicode-property-class patterns. The sticky-regex variant (/…/uy with lastIndex = startOffset) is the canonical pattern in the Chevrotain examples repo for tokenisers that need to handle Unicode identifiers (e.g. Java, Python, ECMAScript grammars).
§4. What NOT to do
§4.1. Do not upgrade Chevrotain
chevrotain@11.0.3 is pinned per ADR-006 / task prompt. Upgrading would require a new ADR and re-ratification. The bug is documented in Chevrotain’s own issue tracker; there is no newer v11 that fixes it without the same custom-matcher workaround.
§4.2. Do not rewrite Identifier as ASCII-only
Acceptance criteria I1–I3 mandate Unicode identifier support (règle, 日本語, règle_日本語). Removing the u flag and switching to [A-Za-z_][A-Za-z0-9_]* passes 19 tests and breaks 3 — net negative. The extraction §1 grammar also specifies LETTER which, per the concept doc, includes Unicode letters.
§4.3. Do not use longer_alt to fake identifiers
The existing 18 keywords already use longer_alt: Identifier. longer_alt only kicks in once the keyword pattern itself has matched a prefix; if Identifier cannot match multi-char Unicode spans, longer_alt cannot help. The custom-matcher fix IS the prerequisite for longer_alt to work correctly.
§4.4. Do not add start_chars_hint
Tested empirically in this audit — start_chars_hint: [...52-letter array...] passes character-dispatch info to Chevrotain but does not repair the regex-execution path. Still fails.
§5. Scope of R83.C
- Fix: one code edit — Identifier’s
patternfield swap, addline_breaks: false. - Tests: existing 82 should become 82/82. Add 1–2 regression tests that specifically exercise (a) a 16-char ASCII identifier and (b) an 8-char CJK identifier to pin the fix.
- Chain artefacts: this audit (
docs/audits/r83-c-lexer-retry-debug-audit.md), fix packet (docs/packets/r83-c-lexer-retry-fix-packet.md), impl commit, verification doc (docs/verification/r83-c-lexer-retry-verification.md).
§6. Out of scope
- Parser, AST, interpreter, evaluator, registry (those are P1.2.2+).
- Any refactor of the other 41 token definitions. They all work.
- Writing the missing
ADR-007-dsl-grammar.md(flagged in R81.B audit §3; still follow-up only). - Upgrading Chevrotain.
- Any change to
src/domains/rules/integer-math.ts(R81.A, shipped in PR #173).
§7. Verification plan (feeds into Step 5 verification doc)
Before commit:
npm test -- --testPathPattern=rules/lexer→ 82/82 passing.- Full suite
npm run build && npm run lint && npm test→ all three gates green; 1123 baseline → at least 1123 (existing) + 82 (lexer). - Empirically confirm:
tokenize('AcceptCommitment')returns oneIdentifiertoken with image"AcceptCommitment";tokenize('règle_日本語')returns one Identifier;tokenize('{')returns oneLBrace;tokenize('a == b')returns three tokens[Identifier, Eq, Identifier].
§8. Risks
| Risk | Mitigation |
|---|---|
| Custom matcher miscalculates line/column | line_breaks: false + Chevrotain’s default position tracker picks up from startOffset and the returned match length; verified in smoke tests. |
| Performance regression (function call per dispatch) | Chevrotain engines custom matchers through the same dispatch table; overhead is one extra function call per candidate Identifier match. Empirically immeasurable on the 82-test suite (< 1 ms per tokenize call even for the golden snippet). |
lastIndex leaks across calls |
Module-level regex is sticky; lastIndex is always re-assigned to startOffset at the top of the matcher. No leakage. |
Fix interacts badly with longer_alt: Identifier on the 18 keywords |
longer_alt operates on the TokenType reference not the pattern, so it still walks through the Identifier matcher when a keyword prefix is followed by more identifier chars. Verified by tests C1–C5. |
§9. Summary
The 21 failing tests have a single root cause: Chevrotain 11.0.3’s internal regex optimiser does not understand Unicode property escapes (\p{XID_Start}, \p{XID_Continue}), so the Identifier token’s pattern is mis-parsed and Chevrotain’s dispatch table gets poisoned for every token that might share a starting character. The fix is to express Identifier as a custom pattern function backed by a sticky Unicode regex, which bypasses Chevrotain’s regex analysis entirely. This is a one-line change; the rest of the 5-step work from R81.B (18 keywords, 12 operators, 7 categories, error handling, golden-snippet fixture) ships unchanged.
Next step: fix packet (docs/packets/r83-c-lexer-retry-fix-packet.md).