R81.B / P1.2.1 — κ DSL Lexer — Execution Packet
Step 3 of the 5-step executor chain. Builds on audit (§1) + contract (§2). Packets gate implementation: once approved, Step 4 is a mechanical translation of this plan.
§P1. Plan at a glance
Deliver, in a single implementation commit (Step 4):
chevrotain@11.0.3pinned inpackage.jsondependencies.package-lock.jsonregenerated.src/domains/rules/lexer.ts— one TypeScript file, roughly 200–260 lines including the JSDoc preamble and comments. No other source files undersrc/domains/rules/in this task.src/__tests__/domains/rules/lexer.test.ts— one Jest file, roughly 300–400 lines. Target 15–25 cases.
Packet is tight on purpose — no speculative APIs, no scaffolding for P1.2.2 beyond the re-exported allTokens array + named token objects already committed by the contract.
§P2. File layout (final)
src/domains/rules/
lexer.ts # NEW — only file in this task
src/__tests__/domains/rules/
lexer.test.ts # NEW — colocated per repo convention
No index.ts barrel, no parser.ts stub, no types.ts split. Single-file.
§P3. lexer.ts — structural sketch
The source file is organised into the following sections, in order:
§P3.1. Header JSDoc (~30 lines)
- File role + Greek letter κ.
- Links to audit, contract, packet, the concept doc, the extraction §1 grammar, and the s11/s12 specs.
- Note on the ADR-006 drift (one-line pointer to audit §3).
- “Pure module” declaration — no I/O, no side effects.
§P3.2. Imports (~4 lines)
import { createToken, Lexer } from 'chevrotain';
import type { TokenType, ILexingResult } from 'chevrotain';
§P3.3. Error messages (const strings) — so tests can match exactly
const FLOAT_REJECTED_MESSAGE =
'Float literals are not supported. Use integers with basis-point scaling.';
const UNDERSCORE_INT_REJECTED_MESSAGE =
'Underscore-separated integer literals are not supported. Write digits without separators.';
§P3.4. Identifier + Variable token definitions
Identifier uses \p{XID_Start}\p{XID_Continue}* with /u flag. Underscore is also permitted as a start character (common in TS/JS/Python codebases) because the extraction §1 grammar’s IDENTIFIER = LETTER { LETTER | DIGIT | "_" } implicitly tolerates leading underscore and downstream consumers commonly do. This is declared explicitly in the packet to be visible on review.
const Identifier = createToken({
name: 'Identifier',
pattern: /[\p{XID_Start}_][\p{XID_Continue}]*/u,
});
const Variable = createToken({
name: 'Variable',
pattern: /\$[A-Za-z_][A-Za-z0-9_]*(?:\.[A-Za-z_][A-Za-z0-9_]*)*/,
});
§P3.5. Keyword token definitions (18)
Each keyword uses longer_alt: Identifier so that rulex parses as Identifier, not Rule followed by Identifier. This is Chevrotain’s idiomatic solution.
const Rule = createToken({ name: 'Rule', pattern: /rule\b/, longer_alt: Identifier });
const Guards = createToken({ name: 'Guards', pattern: /guards\b/, longer_alt: Identifier });
const Effects = createToken({ name: 'Effects', pattern: /effects\b/, longer_alt: Identifier });
const When = createToken({ name: 'When', pattern: /when\b/, longer_alt: Identifier });
const Then = createToken({ name: 'Then', pattern: /then\b/, longer_alt: Identifier });
const If = createToken({ name: 'If', pattern: /if\b/, longer_alt: Identifier });
const Else = createToken({ name: 'Else', pattern: /else\b/, longer_alt: Identifier });
const And = createToken({ name: 'And', pattern: /and\b/, longer_alt: Identifier });
const Or = createToken({ name: 'Or', pattern: /or\b/, longer_alt: Identifier });
const Not = createToken({ name: 'Not', pattern: /not\b/, longer_alt: Identifier });
const True = createToken({ name: 'True', pattern: /true\b/, longer_alt: Identifier });
const False = createToken({ name: 'False', pattern: /false\b/, longer_alt: Identifier });
const Admit = createToken({ name: 'Admit', pattern: /admit\b/, longer_alt: Identifier });
const Reject = createToken({ name: 'Reject', pattern: /reject\b/, longer_alt: Identifier });
const Admission = createToken({ name: 'Admission', pattern: /admission\b/, longer_alt: Identifier });
const Transition = createToken({ name: 'Transition', pattern: /transition\b/, longer_alt: Identifier });
const Consequence= createToken({ name: 'Consequence',pattern: /consequence\b/,longer_alt: Identifier });
const Promotion = createToken({ name: 'Promotion', pattern: /promotion\b/, longer_alt: Identifier });
\b anchors keep keyword recognition honest (e.g. rule1 does not match rule); the longer_alt still covers cases like rulex where \b alone would not catch a Unicode continuation.
§P3.6. Operator token definitions (12)
Two-char operators declared before their single-char starts so longest-match wins.
const Eq = createToken({ name: 'Eq', pattern: /==/ });
const NotEq = createToken({ name: 'NotEq', pattern: /!=/ });
const Lte = createToken({ name: 'Lte', pattern: /<=/ });
const Gte = createToken({ name: 'Gte', pattern: />=/ });
const Arrow = createToken({ name: 'Arrow', pattern: /->/ });
const Lt = createToken({ name: 'Lt', pattern: /</ });
const Gt = createToken({ name: 'Gt', pattern: />/ });
const Plus = createToken({ name: 'Plus', pattern: /\+/ });
const Minus = createToken({ name: 'Minus', pattern: /-/ });
const Mul = createToken({ name: 'Mul', pattern: /\*/ });
const Div = createToken({ name: 'Div', pattern: /\// });
const Mod = createToken({ name: 'Mod', pattern: /%/ });
§P3.7. Delimiter token definitions (7)
const LBrace = createToken({ name: 'LBrace', pattern: /\{/ });
const RBrace = createToken({ name: 'RBrace', pattern: /\}/ });
const LParen = createToken({ name: 'LParen', pattern: /\(/ });
const RParen = createToken({ name: 'RParen', pattern: /\)/ });
const Comma = createToken({ name: 'Comma', pattern: /,/ });
const Colon = createToken({ name: 'Colon', pattern: /:/ });
const Dot = createToken({ name: 'Dot', pattern: /\./ });
§P3.8. Integer + String literals
const StringLiteral = createToken({
name: 'StringLiteral',
pattern: /"(?:\\.|[^"\\])*"/,
});
// Reject patterns — must beat IntegerLiteral by ordering
const FloatRejected = createToken({
name: 'FloatRejected',
pattern: /-?[0-9]+\.[0-9]+/,
});
const UnderscoreIntegerRejected = createToken({
name: 'UnderscoreIntegerRejected',
pattern: /-?[0-9]+(?:_[0-9]+)+/,
});
const IntegerLiteral = createToken({
name: 'IntegerLiteral',
pattern: /-?[0-9]+/,
});
Note on negative sign. The IntegerLiteral pattern includes an optional leading -. This is greedy, which is what the extraction grammar wants (INTEGER = "-"? [0-9]+). The parser will disambiguate a - 5 vs a + -5 via its Unary / Additive productions — not the lexer’s job.
§P3.9. Whitespace (skipped)
const Whitespace = createToken({
name: 'Whitespace',
pattern: /\s+/,
group: Lexer.SKIPPED,
});
§P3.10. allTokens array — the priority ladder
Ordering is load-bearing. Enumerated here in full so reviewers can see the full sequence:
[
Whitespace, // first (skipped)
FloatRejected, UnderscoreIntegerRejected, // error patterns — beat IntegerLiteral
StringLiteral,
// Keywords (18) — before Identifier, each w/ longer_alt
Rule, Guards, Effects, When, Then, If, Else,
And, Or, Not, True, False,
Admit, Reject, Admission, Transition, Consequence, Promotion,
Variable,
Identifier,
IntegerLiteral,
// Operators — two-char first
Eq, NotEq, Lte, Gte, Arrow,
Lt, Gt, Plus, Minus, Mul, Div, Mod,
// Delimiters
LBrace, RBrace, LParen, RParen, Comma, Colon, Dot,
]
§P3.11. Lexer instance + tokenize wrapper
const lexerInstance = new Lexer(allTokens, {
positionTracking: 'full', // default; explicit for clarity
errorMessageProvider: {
buildUnexpectedCharactersMessage: (_full, _offset, length, _line, _column) =>
`Unexpected character(s) of length ${length}`,
buildUnableToPopLexerModeMessage: () => 'Unable to pop lexer mode',
},
});
export function tokenize(input: string): ILexingResult {
const result = lexerInstance.tokenize(input);
// Augment errors from FloatRejected / UnderscoreIntegerRejected tokens:
// Chevrotain tokenizes them fine (they match the pattern), but we need
// each to show up in errors with a helpful message. Post-process:
const augmentedErrors = [...result.errors];
const filteredTokens = [];
for (const tok of result.tokens) {
if (tok.tokenType === FloatRejected) {
augmentedErrors.push({
offset: tok.startOffset,
length: (tok.endOffset ?? tok.startOffset) - tok.startOffset + 1,
line: tok.startLine ?? 1,
column: tok.startColumn ?? 1,
message: FLOAT_REJECTED_MESSAGE,
});
continue; // drop token — it becomes an error, not a token
}
if (tok.tokenType === UnderscoreIntegerRejected) {
augmentedErrors.push({
offset: tok.startOffset,
length: (tok.endOffset ?? tok.startOffset) - tok.startOffset + 1,
line: tok.startLine ?? 1,
column: tok.startColumn ?? 1,
message: UNDERSCORE_INT_REJECTED_MESSAGE,
});
continue;
}
filteredTokens.push(tok);
}
return { ...result, tokens: filteredTokens, errors: augmentedErrors };
}
The “post-process” approach is cleaner than Chevrotain’s categories / custom matchers — it keeps the token definitions declarative while still surfacing both errors in result.errors. The filtered token stream never carries FloatRejected or UnderscoreIntegerRejected tokens onward to the parser.
§P3.12. Exports
export const Keywords = { Rule, Guards, Effects, When, Then, If, Else, And, Or, Not,
True, False, Admit, Reject, Admission, Transition, Consequence, Promotion };
export const Operators = { Eq, NotEq, Lte, Gte, Lt, Gt, Plus, Minus, Mul, Div, Mod, Arrow };
export const Delimiters = { LBrace, RBrace, LParen, RParen, Comma, Colon, Dot };
export const Literals = { Identifier, Variable, IntegerLiteral, StringLiteral };
export { allTokens, tokenize };
export type { IToken, TokenType, ILexingResult, ILexingError } from 'chevrotain';
§P4. Test matrix — lexer.test.ts
§P4.1. Imports + fixtures
import {
tokenize,
Keywords, Operators, Delimiters, Literals,
} from '../../../domains/rules/lexer.js';
const { Rule, Guards, Effects, When, Then, If, Else, And, Or, Not,
True, False, Admit, Reject, Admission, Transition, Consequence, Promotion } = Keywords;
const { Eq, NotEq, Lte, Gte, Lt, Gt, Plus, Minus, Mul, Div, Mod, Arrow } = Operators;
const { LBrace, RBrace, LParen, RParen, Comma, Colon, Dot } = Delimiters;
const { Identifier, Variable, IntegerLiteral, StringLiteral } = Literals;
§P4.2. describe groups + test sketches
Group A — “basic lexing invariants”:
- A1.
tokenize('')→ zero tokens, zero errors, zero groups. - A2.
tokenize(' \n\t ')→ zero tokens, zero errors. - A3. token positions are 1-indexed:
tokenize('x')[0].startLine === 1 && startColumn === 1. - A4.
tokenize('abc')[0].image === 'abc'. - A5. purity —
tokenize('x == 5')called twice yields deep-equal tokens arrays.
Group B — “all 18 keywords”:
- B1 through B18 (loop-driven inside one test or 18 tests-per-keyword depending on brevity preference; likely one test per group with a table-driven loop of
[['rule', Rule], ['guards', Guards], …]).
Group C — “keyword vs identifier boundary”:
- C1.
rulextokenizes as Identifier, not Rule + x. - C2.
rule_xtokenizes as Identifier. - C3.
myRuletokenizes as Identifier. - C4.
rule(standalone) tokenizes as Rule.
Group D — “variables”:
- D1.
$actor→ Variable with image$actor. - D2.
$actor.reputation.execution→ one Variable with image$actor.reputation.execution. - D3.
$a.b→ one Variable. - D4.
$alone (no name) → error.
Group E — “integer literals”:
- E1.
0,1,123,9999→ IntegerLiteral. - E2.
-5→ IntegerLiteral with image-5. - E3.
3.14→ lex error at line 1 col 1, message contains “Float literals”. - E4.
1_000_000→ lex error at line 1 col 1, message contains “Underscore-separated”.
Group F — “operators (12)”:
- F1. Table-driven: each operator in isolation maps to its tokenType.
- F2.
==beats=(no single-char=exists in the token list; this test locks the two-char recognition still works with adjacent whitespace:a == b). - F3.
->beats-:a -> btokenizes as Identifier + Arrow + Identifier. - F4.
<=,>=beat<,>.
Group G — “delimiters”:
- G1. Each of
{}(),,,:,.maps to its delimiter tokenType.
Group H — “string literals”:
- H1.
"hello"→ StringLiteral with image"hello". - H2.
"he\"llo"→ StringLiteral preserving escaped quote. - H3. Unterminated string
"abc→ lex error.
Group I — “Unicode identifiers”:
- I1.
règletokenizes as Identifier. - I2.
日本語tokenizes as Identifier. - I3.
règle_日本語tokenizes as a single Identifier.
Group J — “position tracking on multi-line input”:
- J1.
'abc\n def'— second token starts line 2, column 3. - J2.
'\n\n 3.14'— float error reports line 3, column 3.
Group K — “AcceptCommitment golden snippet”:
- K1. The full 8-line
rule AcceptCommitment { … }block from the concept doc (line 97–110) tokenizes without errors. The test asserts:- token sequence begins with
[Rule, Identifier("AcceptCommitment"), LBrace, Guards, …]. - exactly 0 errors.
- the snippet uses
guards { }andeffects { }from the extraction grammar (the task uses the extraction superset; the concept doc’sguard:/effects:prefix style is a different surface and not tokenized in this test).
- token sequence begins with
The AcceptCommitment fixture used by K1:
rule AcceptCommitment {
guards {
$actor.reputation.execution >= 100 -> admit
else -> reject "reputation too low"
}
effects {
transition($event.id, "PENDING", "ACCEPTED")
consequence($actor, 10)
}
}
This fixture is a synthesised superset-shaped example consistent with the extraction §1 grammar. It replaces the concept doc’s guard: / effects: style because Phase 1 commits to the superset syntax. Mentioned explicitly in the test so reviewers understand the divergence.
Group L — “error-recovery sanity”:
- L1.
@(unknown char) → Chevrotain’s default lex error, errors array non-empty. - L2.
a @ b— lexer recovers and still emits tokens around the@.
§P4.3. Target coverage
- Line/branch coverage on
lexer.tsat 100% (no dead branches in a lexer file — every token definition is exercised by at least one positive test, each error path by at least one negative test). - No snapshot tests — assertions are explicit on
tokenType.nameandimage.
Test count estimate: ~22 tests grouped into 12 describe blocks.
§P5. Order of operations in Step 4
npm install chevrotain@11.0.3inside the worktree. This updatespackage.json+package-lock.json.- Write
src/domains/rules/lexer.tsper §P3. - Write
src/__tests__/domains/rules/lexer.test.tsper §P4. npm run build— confirms TS compiles.npm run lint— confirms eslint passes. If anyanyleaks in (from Chevrotain token shapes), handle with explicit types, not witheslint-disable.npm test— confirms 1085 + N tests pass.- Commit everything in one commit:
feat(r81-b-p1-2-1-lexer): Chevrotain-based κ DSL lexer (18 keywords, 12 ops, 7 categories).
If any gate fails, fix in-place and repeat 4-6. Do not split the commit.
§P6. Acceptance-criteria → plan mapping
| AC (from task prompt) | §P3 ref | §P4 ref |
|---|---|---|
Chevrotain pinned exact in dependencies |
§P5.1 + §P1 | — |
| 7 token categories | §P3.4–P3.9 | §P4.2 Groups D/E/F/G/H + implicit Whitespace |
| 18 keywords | §P3.5 | §P4.2 Group B |
| 12 operators | §P3.6 | §P4.2 Group F |
Variables $-prefixed dot-path |
§P3.4 + §P3.5 | §P4.2 Group D |
| Line/column per token | §P3.11 (positionTracking: 'full') |
§P4.2 A3, J1–2 |
| Float rejected with position | §P3.8 + §P3.11 | §P4.2 E3, J2 |
| Underscore int rejected | §P3.8 + §P3.11 | §P4.2 E4 |
Unicode identifiers via \p{XID_Start} + /u |
§P3.4 | §P4.2 Group I |
Every AC has a §P3 implementation point and a §P4 test reference.
§P7. Risk register
| Risk | Mitigation |
|---|---|
| Chevrotain 11 type surface differs from older tutorials | Rely on import type { … } from 'chevrotain'; TS compiler catches drift. |
Keyword \b fails across Unicode boundaries |
Use longer_alt: Identifier as the primary mechanism; \b is defensive. |
IntegerLiteral negative-sign ambiguity in expressions |
Out of scope — parser handles; lexer is greedy per extraction grammar. |
| Jest ESM + ts-jest surprise with new file | Config is already proven by ε/δ domains. No config delta. |
package-lock.json merge conflict with R81.A in-flight |
R81.A lands in src/domains/execution/integer-math.ts — disjoint from package.json. R81.A does not add a new dep. No conflict expected. Still: commit both files together so git can resolve cleanly if merge order changes. |
| Pre-existing subprocess-smoke flake | Known; documented in MEMORY.md. Re-run once if flaky. |
chevrotain@11.0.3 transitive deps |
Chevrotain has zero runtime deps — package-lock delta is minimal. |
§P8. Post-impl acceptance check
Before calling Step 4 done:
npm run buildgreen.npm run lintgreen — noany, no disables.npm testgreen — new tests green; prior 1085 unchanged or higher.chevrotain@11.0.3appears inpackage.jsondependenciesexactly (no^, no~).package-lock.jsonmodified alongsidepackage.json.src/domains/rules/lexer.tsexists; no other files undersrc/domains/rules/yet.src/__tests__/domains/rules/lexer.test.tsexists.- No edits to any file outside of:
package.json,package-lock.json,src/domains/rules/lexer.ts,src/__tests__/domains/rules/lexer.test.ts, and the step-4 commit message.
Packet is gating — if any of the above is unclear on review, reviewer asks before Step 4 starts. (Self-review by the executing agent satisfies this gate for a non-interactive run.)
§P9. Summary
Single impl file, single test file, one new dep (chevrotain@11.0.3), 18+12+7 token grammar, ~22 tests, 100% coverage. No surprises. Drift note from audit §3 re-propagated into the PR body and the verification doc in Step 5.
Next step: implement (Step 4 of 5).