R81.B / P1.2.1 — κ DSL Lexer — Execution Packet

Step 3 of the 5-step executor chain. Builds on audit (§1) + contract (§2). Packets gate implementation: once approved, Step 4 is a mechanical translation of this plan.

§P1. Plan at a glance

Deliver, in a single implementation commit (Step 4):

  1. chevrotain@11.0.3 pinned in package.json dependencies.
  2. package-lock.json regenerated.
  3. src/domains/rules/lexer.ts — one TypeScript file, roughly 200–260 lines including the JSDoc preamble and comments. No other source files under src/domains/rules/ in this task.
  4. src/__tests__/domains/rules/lexer.test.ts — one Jest file, roughly 300–400 lines. Target 15–25 cases.

Packet is tight on purpose — no speculative APIs, no scaffolding for P1.2.2 beyond the re-exported allTokens array + named token objects already committed by the contract.

§P2. File layout (final)

src/domains/rules/
  lexer.ts                              # NEW — only file in this task
src/__tests__/domains/rules/
  lexer.test.ts                         # NEW — colocated per repo convention

No index.ts barrel, no parser.ts stub, no types.ts split. Single-file.

§P3. lexer.ts — structural sketch

The source file is organised into the following sections, in order:

§P3.1. Header JSDoc (~30 lines)

  • File role + Greek letter κ.
  • Links to audit, contract, packet, the concept doc, the extraction §1 grammar, and the s11/s12 specs.
  • Note on the ADR-006 drift (one-line pointer to audit §3).
  • “Pure module” declaration — no I/O, no side effects.

§P3.2. Imports (~4 lines)

import { createToken, Lexer } from 'chevrotain';
import type { TokenType, ILexingResult } from 'chevrotain';

§P3.3. Error messages (const strings) — so tests can match exactly

const FLOAT_REJECTED_MESSAGE =
  'Float literals are not supported. Use integers with basis-point scaling.';
const UNDERSCORE_INT_REJECTED_MESSAGE =
  'Underscore-separated integer literals are not supported. Write digits without separators.';

§P3.4. Identifier + Variable token definitions

Identifier uses \p{XID_Start}\p{XID_Continue}* with /u flag. Underscore is also permitted as a start character (common in TS/JS/Python codebases) because the extraction §1 grammar’s IDENTIFIER = LETTER { LETTER | DIGIT | "_" } implicitly tolerates leading underscore and downstream consumers commonly do. This is declared explicitly in the packet to be visible on review.

const Identifier = createToken({
  name: 'Identifier',
  pattern: /[\p{XID_Start}_][\p{XID_Continue}]*/u,
});

const Variable = createToken({
  name: 'Variable',
  pattern: /\$[A-Za-z_][A-Za-z0-9_]*(?:\.[A-Za-z_][A-Za-z0-9_]*)*/,
});

§P3.5. Keyword token definitions (18)

Each keyword uses longer_alt: Identifier so that rulex parses as Identifier, not Rule followed by Identifier. This is Chevrotain’s idiomatic solution.

const Rule       = createToken({ name: 'Rule',       pattern: /rule\b/,       longer_alt: Identifier });
const Guards     = createToken({ name: 'Guards',     pattern: /guards\b/,     longer_alt: Identifier });
const Effects    = createToken({ name: 'Effects',    pattern: /effects\b/,    longer_alt: Identifier });
const When       = createToken({ name: 'When',       pattern: /when\b/,       longer_alt: Identifier });
const Then       = createToken({ name: 'Then',       pattern: /then\b/,       longer_alt: Identifier });
const If         = createToken({ name: 'If',         pattern: /if\b/,         longer_alt: Identifier });
const Else       = createToken({ name: 'Else',       pattern: /else\b/,       longer_alt: Identifier });
const And        = createToken({ name: 'And',        pattern: /and\b/,        longer_alt: Identifier });
const Or         = createToken({ name: 'Or',         pattern: /or\b/,         longer_alt: Identifier });
const Not        = createToken({ name: 'Not',        pattern: /not\b/,        longer_alt: Identifier });
const True       = createToken({ name: 'True',       pattern: /true\b/,       longer_alt: Identifier });
const False      = createToken({ name: 'False',      pattern: /false\b/,      longer_alt: Identifier });
const Admit      = createToken({ name: 'Admit',      pattern: /admit\b/,      longer_alt: Identifier });
const Reject     = createToken({ name: 'Reject',     pattern: /reject\b/,     longer_alt: Identifier });
const Admission  = createToken({ name: 'Admission',  pattern: /admission\b/,  longer_alt: Identifier });
const Transition = createToken({ name: 'Transition', pattern: /transition\b/, longer_alt: Identifier });
const Consequence= createToken({ name: 'Consequence',pattern: /consequence\b/,longer_alt: Identifier });
const Promotion  = createToken({ name: 'Promotion',  pattern: /promotion\b/,  longer_alt: Identifier });

\b anchors keep keyword recognition honest (e.g. rule1 does not match rule); the longer_alt still covers cases like rulex where \b alone would not catch a Unicode continuation.

§P3.6. Operator token definitions (12)

Two-char operators declared before their single-char starts so longest-match wins.

const Eq     = createToken({ name: 'Eq',     pattern: /==/   });
const NotEq  = createToken({ name: 'NotEq',  pattern: /!=/   });
const Lte    = createToken({ name: 'Lte',    pattern: /<=/   });
const Gte    = createToken({ name: 'Gte',    pattern: />=/   });
const Arrow  = createToken({ name: 'Arrow',  pattern: /->/   });
const Lt     = createToken({ name: 'Lt',     pattern: /</    });
const Gt     = createToken({ name: 'Gt',     pattern: />/    });
const Plus   = createToken({ name: 'Plus',   pattern: /\+/   });
const Minus  = createToken({ name: 'Minus',  pattern: /-/    });
const Mul    = createToken({ name: 'Mul',    pattern: /\*/   });
const Div    = createToken({ name: 'Div',    pattern: /\//   });
const Mod    = createToken({ name: 'Mod',    pattern: /%/    });

§P3.7. Delimiter token definitions (7)

const LBrace = createToken({ name: 'LBrace', pattern: /\{/ });
const RBrace = createToken({ name: 'RBrace', pattern: /\}/ });
const LParen = createToken({ name: 'LParen', pattern: /\(/ });
const RParen = createToken({ name: 'RParen', pattern: /\)/ });
const Comma  = createToken({ name: 'Comma',  pattern: /,/  });
const Colon  = createToken({ name: 'Colon',  pattern: /:/  });
const Dot    = createToken({ name: 'Dot',    pattern: /\./ });

§P3.8. Integer + String literals

const StringLiteral = createToken({
  name: 'StringLiteral',
  pattern: /"(?:\\.|[^"\\])*"/,
});

// Reject patterns — must beat IntegerLiteral by ordering
const FloatRejected = createToken({
  name: 'FloatRejected',
  pattern: /-?[0-9]+\.[0-9]+/,
});
const UnderscoreIntegerRejected = createToken({
  name: 'UnderscoreIntegerRejected',
  pattern: /-?[0-9]+(?:_[0-9]+)+/,
});

const IntegerLiteral = createToken({
  name: 'IntegerLiteral',
  pattern: /-?[0-9]+/,
});

Note on negative sign. The IntegerLiteral pattern includes an optional leading -. This is greedy, which is what the extraction grammar wants (INTEGER = "-"? [0-9]+). The parser will disambiguate a - 5 vs a + -5 via its Unary / Additive productions — not the lexer’s job.

§P3.9. Whitespace (skipped)

const Whitespace = createToken({
  name: 'Whitespace',
  pattern: /\s+/,
  group: Lexer.SKIPPED,
});

§P3.10. allTokens array — the priority ladder

Ordering is load-bearing. Enumerated here in full so reviewers can see the full sequence:

[
  Whitespace,                               // first (skipped)
  FloatRejected, UnderscoreIntegerRejected, // error patterns — beat IntegerLiteral
  StringLiteral,
  // Keywords (18) — before Identifier, each w/ longer_alt
  Rule, Guards, Effects, When, Then, If, Else,
  And, Or, Not, True, False,
  Admit, Reject, Admission, Transition, Consequence, Promotion,
  Variable,
  Identifier,
  IntegerLiteral,
  // Operators — two-char first
  Eq, NotEq, Lte, Gte, Arrow,
  Lt, Gt, Plus, Minus, Mul, Div, Mod,
  // Delimiters
  LBrace, RBrace, LParen, RParen, Comma, Colon, Dot,
]

§P3.11. Lexer instance + tokenize wrapper

const lexerInstance = new Lexer(allTokens, {
  positionTracking: 'full',           // default; explicit for clarity
  errorMessageProvider: {
    buildUnexpectedCharactersMessage: (_full, _offset, length, _line, _column) =>
      `Unexpected character(s) of length ${length}`,
    buildUnableToPopLexerModeMessage: () => 'Unable to pop lexer mode',
  },
});

export function tokenize(input: string): ILexingResult {
  const result = lexerInstance.tokenize(input);

  // Augment errors from FloatRejected / UnderscoreIntegerRejected tokens:
  // Chevrotain tokenizes them fine (they match the pattern), but we need
  // each to show up in errors with a helpful message. Post-process:
  const augmentedErrors = [...result.errors];
  const filteredTokens = [];
  for (const tok of result.tokens) {
    if (tok.tokenType === FloatRejected) {
      augmentedErrors.push({
        offset: tok.startOffset,
        length: (tok.endOffset ?? tok.startOffset) - tok.startOffset + 1,
        line: tok.startLine ?? 1,
        column: tok.startColumn ?? 1,
        message: FLOAT_REJECTED_MESSAGE,
      });
      continue;   // drop token — it becomes an error, not a token
    }
    if (tok.tokenType === UnderscoreIntegerRejected) {
      augmentedErrors.push({
        offset: tok.startOffset,
        length: (tok.endOffset ?? tok.startOffset) - tok.startOffset + 1,
        line: tok.startLine ?? 1,
        column: tok.startColumn ?? 1,
        message: UNDERSCORE_INT_REJECTED_MESSAGE,
      });
      continue;
    }
    filteredTokens.push(tok);
  }
  return { ...result, tokens: filteredTokens, errors: augmentedErrors };
}

The “post-process” approach is cleaner than Chevrotain’s categories / custom matchers — it keeps the token definitions declarative while still surfacing both errors in result.errors. The filtered token stream never carries FloatRejected or UnderscoreIntegerRejected tokens onward to the parser.

§P3.12. Exports

export const Keywords = { Rule, Guards, Effects, When, Then, If, Else, And, Or, Not,
  True, False, Admit, Reject, Admission, Transition, Consequence, Promotion };
export const Operators = { Eq, NotEq, Lte, Gte, Lt, Gt, Plus, Minus, Mul, Div, Mod, Arrow };
export const Delimiters = { LBrace, RBrace, LParen, RParen, Comma, Colon, Dot };
export const Literals = { Identifier, Variable, IntegerLiteral, StringLiteral };
export { allTokens, tokenize };
export type { IToken, TokenType, ILexingResult, ILexingError } from 'chevrotain';

§P4. Test matrix — lexer.test.ts

§P4.1. Imports + fixtures

import {
  tokenize,
  Keywords, Operators, Delimiters, Literals,
} from '../../../domains/rules/lexer.js';

const { Rule, Guards, Effects, When, Then, If, Else, And, Or, Not,
  True, False, Admit, Reject, Admission, Transition, Consequence, Promotion } = Keywords;
const { Eq, NotEq, Lte, Gte, Lt, Gt, Plus, Minus, Mul, Div, Mod, Arrow } = Operators;
const { LBrace, RBrace, LParen, RParen, Comma, Colon, Dot } = Delimiters;
const { Identifier, Variable, IntegerLiteral, StringLiteral } = Literals;

§P4.2. describe groups + test sketches

Group A — “basic lexing invariants”:

  • A1. tokenize('') → zero tokens, zero errors, zero groups.
  • A2. tokenize(' \n\t ') → zero tokens, zero errors.
  • A3. token positions are 1-indexed: tokenize('x')[0].startLine === 1 && startColumn === 1.
  • A4. tokenize('abc')[0].image === 'abc'.
  • A5. purity — tokenize('x == 5') called twice yields deep-equal tokens arrays.

Group B — “all 18 keywords”:

  • B1 through B18 (loop-driven inside one test or 18 tests-per-keyword depending on brevity preference; likely one test per group with a table-driven loop of [['rule', Rule], ['guards', Guards], …]).

Group C — “keyword vs identifier boundary”:

  • C1. rulex tokenizes as Identifier, not Rule + x.
  • C2. rule_x tokenizes as Identifier.
  • C3. myRule tokenizes as Identifier.
  • C4. rule (standalone) tokenizes as Rule.

Group D — “variables”:

  • D1. $actor → Variable with image $actor.
  • D2. $actor.reputation.execution → one Variable with image $actor.reputation.execution.
  • D3. $a.b → one Variable.
  • D4. $ alone (no name) → error.

Group E — “integer literals”:

  • E1. 0, 1, 123, 9999 → IntegerLiteral.
  • E2. -5 → IntegerLiteral with image -5.
  • E3. 3.14 → lex error at line 1 col 1, message contains “Float literals”.
  • E4. 1_000_000 → lex error at line 1 col 1, message contains “Underscore-separated”.

Group F — “operators (12)”:

  • F1. Table-driven: each operator in isolation maps to its tokenType.
  • F2. == beats = (no single-char = exists in the token list; this test locks the two-char recognition still works with adjacent whitespace: a == b).
  • F3. -> beats -: a -> b tokenizes as Identifier + Arrow + Identifier.
  • F4. <=, >= beat <, >.

Group G — “delimiters”:

  • G1. Each of {}(), ,, :, . maps to its delimiter tokenType.

Group H — “string literals”:

  • H1. "hello" → StringLiteral with image "hello".
  • H2. "he\"llo" → StringLiteral preserving escaped quote.
  • H3. Unterminated string "abc → lex error.

Group I — “Unicode identifiers”:

  • I1. règle tokenizes as Identifier.
  • I2. 日本語 tokenizes as Identifier.
  • I3. règle_日本語 tokenizes as a single Identifier.

Group J — “position tracking on multi-line input”:

  • J1. 'abc\n def' — second token starts line 2, column 3.
  • J2. '\n\n 3.14' — float error reports line 3, column 3.

Group K — “AcceptCommitment golden snippet”:

  • K1. The full 8-line rule AcceptCommitment { … } block from the concept doc (line 97–110) tokenizes without errors. The test asserts:
    • token sequence begins with [Rule, Identifier("AcceptCommitment"), LBrace, Guards, …].
    • exactly 0 errors.
    • the snippet uses guards { } and effects { } from the extraction grammar (the task uses the extraction superset; the concept doc’s guard: / effects: prefix style is a different surface and not tokenized in this test).

The AcceptCommitment fixture used by K1:

rule AcceptCommitment {
  guards {
    $actor.reputation.execution >= 100 -> admit
    else -> reject "reputation too low"
  }
  effects {
    transition($event.id, "PENDING", "ACCEPTED")
    consequence($actor, 10)
  }
}

This fixture is a synthesised superset-shaped example consistent with the extraction §1 grammar. It replaces the concept doc’s guard: / effects: style because Phase 1 commits to the superset syntax. Mentioned explicitly in the test so reviewers understand the divergence.

Group L — “error-recovery sanity”:

  • L1. @ (unknown char) → Chevrotain’s default lex error, errors array non-empty.
  • L2. a @ b — lexer recovers and still emits tokens around the @.

§P4.3. Target coverage

  • Line/branch coverage on lexer.ts at 100% (no dead branches in a lexer file — every token definition is exercised by at least one positive test, each error path by at least one negative test).
  • No snapshot tests — assertions are explicit on tokenType.name and image.

Test count estimate: ~22 tests grouped into 12 describe blocks.

§P5. Order of operations in Step 4

  1. npm install chevrotain@11.0.3 inside the worktree. This updates package.json + package-lock.json.
  2. Write src/domains/rules/lexer.ts per §P3.
  3. Write src/__tests__/domains/rules/lexer.test.ts per §P4.
  4. npm run build — confirms TS compiles.
  5. npm run lint — confirms eslint passes. If any any leaks in (from Chevrotain token shapes), handle with explicit types, not with eslint-disable.
  6. npm test — confirms 1085 + N tests pass.
  7. Commit everything in one commit: feat(r81-b-p1-2-1-lexer): Chevrotain-based κ DSL lexer (18 keywords, 12 ops, 7 categories).

If any gate fails, fix in-place and repeat 4-6. Do not split the commit.

§P6. Acceptance-criteria → plan mapping

AC (from task prompt) §P3 ref §P4 ref
Chevrotain pinned exact in dependencies §P5.1 + §P1
7 token categories §P3.4–P3.9 §P4.2 Groups D/E/F/G/H + implicit Whitespace
18 keywords §P3.5 §P4.2 Group B
12 operators §P3.6 §P4.2 Group F
Variables $-prefixed dot-path §P3.4 + §P3.5 §P4.2 Group D
Line/column per token §P3.11 (positionTracking: 'full') §P4.2 A3, J1–2
Float rejected with position §P3.8 + §P3.11 §P4.2 E3, J2
Underscore int rejected §P3.8 + §P3.11 §P4.2 E4
Unicode identifiers via \p{XID_Start} + /u §P3.4 §P4.2 Group I

Every AC has a §P3 implementation point and a §P4 test reference.

§P7. Risk register

Risk Mitigation
Chevrotain 11 type surface differs from older tutorials Rely on import type { … } from 'chevrotain'; TS compiler catches drift.
Keyword \b fails across Unicode boundaries Use longer_alt: Identifier as the primary mechanism; \b is defensive.
IntegerLiteral negative-sign ambiguity in expressions Out of scope — parser handles; lexer is greedy per extraction grammar.
Jest ESM + ts-jest surprise with new file Config is already proven by ε/δ domains. No config delta.
package-lock.json merge conflict with R81.A in-flight R81.A lands in src/domains/execution/integer-math.ts — disjoint from package.json. R81.A does not add a new dep. No conflict expected. Still: commit both files together so git can resolve cleanly if merge order changes.
Pre-existing subprocess-smoke flake Known; documented in MEMORY.md. Re-run once if flaky.
chevrotain@11.0.3 transitive deps Chevrotain has zero runtime deps — package-lock delta is minimal.

§P8. Post-impl acceptance check

Before calling Step 4 done:

  • npm run build green.
  • npm run lint green — no any, no disables.
  • npm test green — new tests green; prior 1085 unchanged or higher.
  • chevrotain@11.0.3 appears in package.json dependencies exactly (no ^, no ~).
  • package-lock.json modified alongside package.json.
  • src/domains/rules/lexer.ts exists; no other files under src/domains/rules/ yet.
  • src/__tests__/domains/rules/lexer.test.ts exists.
  • No edits to any file outside of: package.json, package-lock.json, src/domains/rules/lexer.ts, src/__tests__/domains/rules/lexer.test.ts, and the step-4 commit message.

Packet is gating — if any of the above is unclear on review, reviewer asks before Step 4 starts. (Self-review by the executing agent satisfies this gate for a non-interactive run.)

§P9. Summary

Single impl file, single test file, one new dep (chevrotain@11.0.3), 18+12+7 token grammar, ~22 tests, 100% coverage. No surprises. Drift note from audit §3 re-propagated into the PR body and the verification doc in Step 5.

Next step: implement (Step 4 of 5).


Back to top

Colibri — documentation-first MCP runtime. Apache 2.0 + Commons Clause.

This site uses Just the Docs, a documentation theme for Jekyll.