R81.B / P1.2.1 — κ DSL Lexer — Behavioral Contract

Step 2 of the 5-step executor chain. Builds on docs/audits/r81-b-p1-2-1-lexer-audit.md. Defines the public surface, semantics, and invariants for src/domains/rules/lexer.ts.

§1. Module identity

Path: src/domains/rules/lexer.ts
Axis: κ — Rule Engine (Phase 1)
Kind: pure synchronous module; no I/O, no DB, no network, no env reads, no console output
Runtime dependency: chevrotain@11.0.3 — exact pin
Internal dependencies: none (does not import from src/domains/*, src/db/*, src/middleware/*)

§2. Public API

The module exports three named entities plus a handful of typed re-exports:

// Token-type registry (for the parser to consume)
export const allTokens: TokenType[];

// Named tokens — the parser references these by reference (not by name)
export const Keywords: {
  Rule: TokenType; Guards: TokenType; Effects: TokenType;
  When: TokenType; Then: TokenType; If: TokenType; Else: TokenType;
  And: TokenType; Or: TokenType; Not: TokenType;
  True: TokenType; False: TokenType;
  Admit: TokenType; Reject: TokenType;
  Admission: TokenType; Transition: TokenType;
  Consequence: TokenType; Promotion: TokenType;
};

export const Operators: {
  Eq: TokenType; NotEq: TokenType;
  Lte: TokenType; Gte: TokenType; Lt: TokenType; Gt: TokenType;
  Plus: TokenType; Minus: TokenType; Mul: TokenType; Div: TokenType; Mod: TokenType;
  Arrow: TokenType;
};

export const Delimiters: {
  LBrace: TokenType; RBrace: TokenType;
  LParen: TokenType; RParen: TokenType;
  Comma: TokenType; Colon: TokenType; Dot: TokenType;
};

export const Literals: {
  Identifier: TokenType;
  Variable: TokenType;
  IntegerLiteral: TokenType;
  StringLiteral: TokenType;
};

// The entry point — wraps Chevrotain's Lexer.tokenize
export function tokenize(input: string): ILexingResult;

Where TokenType and ILexingResult are re-exported from chevrotain:

export type { IToken, TokenType, ILexingResult, ILexingError } from 'chevrotain';

This gives the future parser (src/domains/rules/parser.ts, P1.2.2) a clean import surface without reaching into Chevrotain directly.

§3. Function semantics — `tokenize`

Signature: tokenize(input: string): ILexingResult

Behavior:

Delegates to a module-level Chevrotain Lexer instance constructed once from allTokens.
Never throws. All lexical errors are returned as entries in result.errors with line/column/length/offset/message.
Returns { tokens: IToken[], groups: Record<string, IToken[]>, errors: ILexingError[] }.
input === '' returns { tokens: [], groups: {}, errors: [] }.
Whitespace ( , \t, \r, \n) is consumed and does not appear in tokens or errors.
Every token in tokens carries startLine, startColumn, endLine, endColumn, startOffset, endOffset, and image.

Purity:

No time reads, no random reads, no DB reads, no network calls, no file-system access.
No side effects on import; tokenize('x') called at module load yields no observable effect beyond the returned value.
Calling tokenize(s) twice with equal s returns structurally-equal tokens arrays (deep equality excluding object identity).

§4. Token taxonomy

The lexer recognises seven token categories (see audit §5), implemented as Chevrotain TokenType instances via createToken. Longest-match ordering (Chevrotain’s allTokens array) pins the following priorities, from highest to lowest:

Whitespace — /\s+/ with group: Lexer.SKIPPED.
Error-recovery literals (must beat IntegerLiteral):
- FloatRejected — /-?[0-9]+\.[0-9]+/, custom matcher pushes error + advances past the match.
- UnderscoreIntegerRejected — /-?[0-9]+(?:_[0-9]+)+/, same pattern.
StringLiteral — /"(?:\\.|[^"\\])*"/.
Keywords (18 of them, listed in audit §5.1) — each declared as its own TokenType with the fixed-string pattern. longer_alt: Identifier ensures a source like rulex parses as Identifier, not Rule then Identifier.
Variable — /\$[A-Za-z_][A-Za-z0-9_]*(?:\.[A-Za-z_][A-Za-z0-9_]*)*/.
Identifier — /[\p{XID_Start}_][\p{XID_Continue}]*/u.
IntegerLiteral — /-?[0-9]+/ (negative sign binds when no preceding operand; parser disambiguates).
Operators (12, two-char before one-char) — ==, !=, <=, >=, ->, then <, >, +, -, *, /, %.
Delimiters — {, }, (, ), ,, :, ..

Rule 4 uses Chevrotain’s longer_alt feature so keyword tokens degrade gracefully when they appear as identifier prefixes. This avoids the “keywords before identifier” footgun while still preserving the ordering discipline for other tokens.

§5. Invariants

The following invariants hold for every tokenize(input) call:

ID	Invariant	Verification
I1	Function returns an `ILexingResult` (never throws)	Jest expects no-throw
I2	Empty input → empty tokens, empty errors	test in `__tests__`
I3	Whitespace-only input → empty tokens, empty errors	test
I4	Every token has `startLine ≥ 1`, `startColumn ≥ 1` (Chevrotain 1-indexed)	test
I5	Every token has `image` equal to the source substring `[startOffset, endOffset]` inclusive	test
I6	Float literal `N.M` produces an entry in `errors` at the right line/column	test
I7	Underscore integer `1_000` produces an entry in `errors` at the right line/column	test
I8	A keyword followed by `-` and an identifier does not merge across the operator	test
I9	Keyword is prefix of identifier → identifier wins (`longer_alt`)	test
I10	Variable’s `image` equals the full `$…` span, dots included	test
I11	Two-char operator beats one-char prefix (`==` not `=` then `=`)	test
I12	String literal preserves inner escapes verbatim in `image`	test
I13	Unicode identifier (e.g. `règle`, `日本語`) tokenizes as a single Identifier	test
I14	Full AcceptCommitment concept-doc snippet tokenizes without errors	test (golden sequence)
I15	`tokenize(s)` is referentially transparent (purity)	test (run twice, deep-equal)

§6. Error model

Lexer-level errors are reported via result.errors: ILexingError[], never thrown. Each error has:

{
  offset: number;       // 0-based absolute character offset
  length: number;       // span length
  line: number;         // 1-based line
  column: number;       // 1-based column
  message: string;      // human-readable reason
}

Three error sources:

Unrecognised character — Chevrotain’s default recovery for inputs that match no token.
Float literal rejected — custom matcher attached to FloatRejected; message includes the offending substring and a hint ("Float literals are not supported. Use integers with basis-point scaling.").
Underscore integer rejected — same pattern, message: "Underscore-separated integer literals are not supported. Write digits without separators.".

The lexer does not do parser-level diagnostics (missing brace, unbalanced paren, etc.) — that is strictly P1.2.2.

§7. Dependency rules

In: only chevrotain. Out: exported types reach P1.2.2 parser (future).

Explicitly forbidden (at least until the κ engine is more complete):

No imports from src/db/* — the lexer must not touch SQLite.
No imports from src/middleware/* — the lexer is outside the MCP pipeline.
No imports from src/domains/tasks/*, src/domains/skills/*, src/domains/trail/*, src/domains/proof/*, src/domains/router/*, src/domains/integrations/* — κ is a peer axis, not a dependent.
No Node built-ins (fs, path, crypto, os, child_process, http, net, …) — the lexer operates on an in-memory string only.

Assembly wiring (Phase 1 follow-up): the κ engine module root src/domains/rules/index.ts is not part of this task. P1.2.1 ships lexer.ts only. Downstream tasks (P1.2.2 parser, P1.2.3 interpreter, P1.2.4 registry) will add more files and the eventual index.ts barrel.

§8. Performance envelope (informational, not gated)

Short rule (~50 tokens) tokenizes in < 1 ms on a modern laptop — Chevrotain is production-grade.
Memory: tokens array size proportional to input length; no caching across calls.
No memoization — callers (parser) may cache if they wish.
This task defines no performance SLO; if SLOs are needed later, they belong in P1.2.x or ADR-007.

§9. Non-goals (re-stated from audit §10)

The contract explicitly excludes:

Parser, AST, interpreter, evaluator.
Comment syntax, interpolation, triple-quoted strings, heredocs.
Float support, underscore separators, hex/octal/binary, scientific notation.
Run-time evaluation of any rule.
Integration with β/ε/ζ/η/ν/δ axes or any MCP tool.
A new ADR (see drift finding in audit §3).
src/domains/rules/index.ts barrel file.
Tokenization performance SLOs.

§10. Change log

v1 (this commit) — initial contract.

Any subsequent change to the public surface of lexer.ts MUST land a contract revision in the same PR. Backward-incompatible changes MUST advance a minor version note here.

§11. Traceability

Requirement	Where defined	Where tested
18 keywords	audit §5.1 + contract §4 rule 4	`lexer.test.ts` keyword matrix
12 operators	audit §5.2 + contract §4 rules 8	`lexer.test.ts` operator matrix
7 categories	audit §5.3 + contract §2	`lexer.test.ts` (positive matrix)
Line/column tracking	contract §3 / I4-5	`lexer.test.ts` position test
Float rejected	contract §6 / I6	`lexer.test.ts` “3.14 rejects”
Underscore int rejected	contract §6 / I7	`lexer.test.ts` “1_000 rejects”
Unicode identifier	contract §4 rule 6 / I13	`lexer.test.ts` “règle_日本語”
Variable with dot path	contract §4 rule 5 / I10	`lexer.test.ts` “$actor.reputation.execution”
AcceptCommitment golden	audit §6.1 / I14	`lexer.test.ts` “AcceptCommitment snippet”

§12. Summary

src/domains/rules/lexer.ts exports a single function tokenize(input: string): ILexingResult plus typed token-type re-exports. It uses Chevrotain 11.0.3 to recognise 18 keywords, 12 operators, and 7 token categories, never throws, reports all lexical errors with 1-indexed line/column, and is strictly pure (no I/O). Float literals and underscore-separated integers are rejected with positioned errors. The contract is narrow on purpose — the parser, AST, and interpreter are separate tasks.

Next step: packet (Step 3 of 5).