R81.B / P1.2.1 — κ DSL Lexer — Behavioral Contract

Step 2 of the 5-step executor chain. Builds on docs/audits/r81-b-p1-2-1-lexer-audit.md. Defines the public surface, semantics, and invariants for src/domains/rules/lexer.ts.

§1. Module identity

  • Path: src/domains/rules/lexer.ts
  • Axis: κ — Rule Engine (Phase 1)
  • Kind: pure synchronous module; no I/O, no DB, no network, no env reads, no console output
  • Runtime dependency: chevrotain@11.0.3 — exact pin
  • Internal dependencies: none (does not import from src/domains/*, src/db/*, src/middleware/*)

§2. Public API

The module exports three named entities plus a handful of typed re-exports:

// Token-type registry (for the parser to consume)
export const allTokens: TokenType[];

// Named tokens — the parser references these by reference (not by name)
export const Keywords: {
  Rule: TokenType; Guards: TokenType; Effects: TokenType;
  When: TokenType; Then: TokenType; If: TokenType; Else: TokenType;
  And: TokenType; Or: TokenType; Not: TokenType;
  True: TokenType; False: TokenType;
  Admit: TokenType; Reject: TokenType;
  Admission: TokenType; Transition: TokenType;
  Consequence: TokenType; Promotion: TokenType;
};

export const Operators: {
  Eq: TokenType; NotEq: TokenType;
  Lte: TokenType; Gte: TokenType; Lt: TokenType; Gt: TokenType;
  Plus: TokenType; Minus: TokenType; Mul: TokenType; Div: TokenType; Mod: TokenType;
  Arrow: TokenType;
};

export const Delimiters: {
  LBrace: TokenType; RBrace: TokenType;
  LParen: TokenType; RParen: TokenType;
  Comma: TokenType; Colon: TokenType; Dot: TokenType;
};

export const Literals: {
  Identifier: TokenType;
  Variable: TokenType;
  IntegerLiteral: TokenType;
  StringLiteral: TokenType;
};

// The entry point — wraps Chevrotain's Lexer.tokenize
export function tokenize(input: string): ILexingResult;

Where TokenType and ILexingResult are re-exported from chevrotain:

export type { IToken, TokenType, ILexingResult, ILexingError } from 'chevrotain';

This gives the future parser (src/domains/rules/parser.ts, P1.2.2) a clean import surface without reaching into Chevrotain directly.

§3. Function semantics — tokenize

Signature: tokenize(input: string): ILexingResult

Behavior:

  • Delegates to a module-level Chevrotain Lexer instance constructed once from allTokens.
  • Never throws. All lexical errors are returned as entries in result.errors with line/column/length/offset/message.
  • Returns { tokens: IToken[], groups: Record<string, IToken[]>, errors: ILexingError[] }.
  • input === '' returns { tokens: [], groups: {}, errors: [] }.
  • Whitespace ( , \t, \r, \n) is consumed and does not appear in tokens or errors.
  • Every token in tokens carries startLine, startColumn, endLine, endColumn, startOffset, endOffset, and image.

Purity:

  • No time reads, no random reads, no DB reads, no network calls, no file-system access.
  • No side effects on import; tokenize('x') called at module load yields no observable effect beyond the returned value.
  • Calling tokenize(s) twice with equal s returns structurally-equal tokens arrays (deep equality excluding object identity).

§4. Token taxonomy

The lexer recognises seven token categories (see audit §5), implemented as Chevrotain TokenType instances via createToken. Longest-match ordering (Chevrotain’s allTokens array) pins the following priorities, from highest to lowest:

  1. Whitespace/\s+/ with group: Lexer.SKIPPED.
  2. Error-recovery literals (must beat IntegerLiteral):
    • FloatRejected/-?[0-9]+\.[0-9]+/, custom matcher pushes error + advances past the match.
    • UnderscoreIntegerRejected/-?[0-9]+(?:_[0-9]+)+/, same pattern.
  3. StringLiteral/"(?:\\.|[^"\\])*"/.
  4. Keywords (18 of them, listed in audit §5.1) — each declared as its own TokenType with the fixed-string pattern. longer_alt: Identifier ensures a source like rulex parses as Identifier, not Rule then Identifier.
  5. Variable/\$[A-Za-z_][A-Za-z0-9_]*(?:\.[A-Za-z_][A-Za-z0-9_]*)*/.
  6. Identifier/[\p{XID_Start}_][\p{XID_Continue}]*/u.
  7. IntegerLiteral/-?[0-9]+/ (negative sign binds when no preceding operand; parser disambiguates).
  8. Operators (12, two-char before one-char) — ==, !=, <=, >=, ->, then <, >, +, -, *, /, %.
  9. Delimiters{, }, (, ), ,, :, ..

Rule 4 uses Chevrotain’s longer_alt feature so keyword tokens degrade gracefully when they appear as identifier prefixes. This avoids the “keywords before identifier” footgun while still preserving the ordering discipline for other tokens.

§5. Invariants

The following invariants hold for every tokenize(input) call:

ID Invariant Verification
I1 Function returns an ILexingResult (never throws) Jest expects no-throw
I2 Empty input → empty tokens, empty errors test in __tests__
I3 Whitespace-only input → empty tokens, empty errors test
I4 Every token has startLine ≥ 1, startColumn ≥ 1 (Chevrotain 1-indexed) test
I5 Every token has image equal to the source substring [startOffset, endOffset] inclusive test
I6 Float literal N.M produces an entry in errors at the right line/column test
I7 Underscore integer 1_000 produces an entry in errors at the right line/column test
I8 A keyword followed by - and an identifier does not merge across the operator test
I9 Keyword is prefix of identifier → identifier wins (longer_alt) test
I10 Variable’s image equals the full $… span, dots included test
I11 Two-char operator beats one-char prefix (== not = then =) test
I12 String literal preserves inner escapes verbatim in image test
I13 Unicode identifier (e.g. règle, 日本語) tokenizes as a single Identifier test
I14 Full AcceptCommitment concept-doc snippet tokenizes without errors test (golden sequence)
I15 tokenize(s) is referentially transparent (purity) test (run twice, deep-equal)

§6. Error model

Lexer-level errors are reported via result.errors: ILexingError[], never thrown. Each error has:

{
  offset: number;       // 0-based absolute character offset
  length: number;       // span length
  line: number;         // 1-based line
  column: number;       // 1-based column
  message: string;      // human-readable reason
}

Three error sources:

  1. Unrecognised character — Chevrotain’s default recovery for inputs that match no token.
  2. Float literal rejected — custom matcher attached to FloatRejected; message includes the offending substring and a hint ("Float literals are not supported. Use integers with basis-point scaling.").
  3. Underscore integer rejected — same pattern, message: "Underscore-separated integer literals are not supported. Write digits without separators.".

The lexer does not do parser-level diagnostics (missing brace, unbalanced paren, etc.) — that is strictly P1.2.2.

§7. Dependency rules

In: only chevrotain. Out: exported types reach P1.2.2 parser (future).

Explicitly forbidden (at least until the κ engine is more complete):

  • No imports from src/db/* — the lexer must not touch SQLite.
  • No imports from src/middleware/* — the lexer is outside the MCP pipeline.
  • No imports from src/domains/tasks/*, src/domains/skills/*, src/domains/trail/*, src/domains/proof/*, src/domains/router/*, src/domains/integrations/* — κ is a peer axis, not a dependent.
  • No Node built-ins (fs, path, crypto, os, child_process, http, net, …) — the lexer operates on an in-memory string only.

Assembly wiring (Phase 1 follow-up): the κ engine module root src/domains/rules/index.ts is not part of this task. P1.2.1 ships lexer.ts only. Downstream tasks (P1.2.2 parser, P1.2.3 interpreter, P1.2.4 registry) will add more files and the eventual index.ts barrel.

§8. Performance envelope (informational, not gated)

  • Short rule (~50 tokens) tokenizes in < 1 ms on a modern laptop — Chevrotain is production-grade.
  • Memory: tokens array size proportional to input length; no caching across calls.
  • No memoization — callers (parser) may cache if they wish.
  • This task defines no performance SLO; if SLOs are needed later, they belong in P1.2.x or ADR-007.

§9. Non-goals (re-stated from audit §10)

The contract explicitly excludes:

  • Parser, AST, interpreter, evaluator.
  • Comment syntax, interpolation, triple-quoted strings, heredocs.
  • Float support, underscore separators, hex/octal/binary, scientific notation.
  • Run-time evaluation of any rule.
  • Integration with β/ε/ζ/η/ν/δ axes or any MCP tool.
  • A new ADR (see drift finding in audit §3).
  • src/domains/rules/index.ts barrel file.
  • Tokenization performance SLOs.

§10. Change log

  • v1 (this commit) — initial contract.

Any subsequent change to the public surface of lexer.ts MUST land a contract revision in the same PR. Backward-incompatible changes MUST advance a minor version note here.

§11. Traceability

Requirement Where defined Where tested
18 keywords audit §5.1 + contract §4 rule 4 lexer.test.ts keyword matrix
12 operators audit §5.2 + contract §4 rules 8 lexer.test.ts operator matrix
7 categories audit §5.3 + contract §2 lexer.test.ts (positive matrix)
Line/column tracking contract §3 / I4-5 lexer.test.ts position test
Float rejected contract §6 / I6 lexer.test.ts “3.14 rejects”
Underscore int rejected contract §6 / I7 lexer.test.ts “1_000 rejects”
Unicode identifier contract §4 rule 6 / I13 lexer.test.ts “règle_日本語”
Variable with dot path contract §4 rule 5 / I10 lexer.test.ts “$actor.reputation.execution”
AcceptCommitment golden audit §6.1 / I14 lexer.test.ts “AcceptCommitment snippet”

§12. Summary

src/domains/rules/lexer.ts exports a single function tokenize(input: string): ILexingResult plus typed token-type re-exports. It uses Chevrotain 11.0.3 to recognise 18 keywords, 12 operators, and 7 token categories, never throws, reports all lexical errors with 1-indexed line/column, and is strictly pure (no I/O). Float literals and underscore-separated integers are rejected with positioned errors. The contract is narrow on purpose — the parser, AST, and interpreter are separate tasks.

Next step: packet (Step 3 of 5).


Back to top

Colibri — documentation-first MCP runtime. Apache 2.0 + Commons Clause.

This site uses Just the Docs, a documentation theme for Jekyll.