R81.B / P1.2.1 — κ DSL Lexer — Behavioral Contract
Step 2 of the 5-step executor chain. Builds on
docs/audits/r81-b-p1-2-1-lexer-audit.md. Defines the public surface, semantics, and invariants forsrc/domains/rules/lexer.ts.
§1. Module identity
- Path:
src/domains/rules/lexer.ts - Axis: κ — Rule Engine (Phase 1)
- Kind: pure synchronous module; no I/O, no DB, no network, no env reads, no console output
- Runtime dependency:
chevrotain@11.0.3— exact pin - Internal dependencies: none (does not import from
src/domains/*,src/db/*,src/middleware/*)
§2. Public API
The module exports three named entities plus a handful of typed re-exports:
// Token-type registry (for the parser to consume)
export const allTokens: TokenType[];
// Named tokens — the parser references these by reference (not by name)
export const Keywords: {
Rule: TokenType; Guards: TokenType; Effects: TokenType;
When: TokenType; Then: TokenType; If: TokenType; Else: TokenType;
And: TokenType; Or: TokenType; Not: TokenType;
True: TokenType; False: TokenType;
Admit: TokenType; Reject: TokenType;
Admission: TokenType; Transition: TokenType;
Consequence: TokenType; Promotion: TokenType;
};
export const Operators: {
Eq: TokenType; NotEq: TokenType;
Lte: TokenType; Gte: TokenType; Lt: TokenType; Gt: TokenType;
Plus: TokenType; Minus: TokenType; Mul: TokenType; Div: TokenType; Mod: TokenType;
Arrow: TokenType;
};
export const Delimiters: {
LBrace: TokenType; RBrace: TokenType;
LParen: TokenType; RParen: TokenType;
Comma: TokenType; Colon: TokenType; Dot: TokenType;
};
export const Literals: {
Identifier: TokenType;
Variable: TokenType;
IntegerLiteral: TokenType;
StringLiteral: TokenType;
};
// The entry point — wraps Chevrotain's Lexer.tokenize
export function tokenize(input: string): ILexingResult;
Where TokenType and ILexingResult are re-exported from chevrotain:
export type { IToken, TokenType, ILexingResult, ILexingError } from 'chevrotain';
This gives the future parser (src/domains/rules/parser.ts, P1.2.2) a clean import surface without reaching into Chevrotain directly.
§3. Function semantics — tokenize
Signature: tokenize(input: string): ILexingResult
Behavior:
- Delegates to a module-level Chevrotain
Lexerinstance constructed once fromallTokens. - Never throws. All lexical errors are returned as entries in
result.errorswithline/column/length/offset/message. - Returns
{ tokens: IToken[], groups: Record<string, IToken[]>, errors: ILexingError[] }. input === ''returns{ tokens: [], groups: {}, errors: [] }.- Whitespace (
,\t,\r,\n) is consumed and does not appear intokensorerrors. - Every token in
tokenscarriesstartLine,startColumn,endLine,endColumn,startOffset,endOffset, andimage.
Purity:
- No time reads, no random reads, no DB reads, no network calls, no file-system access.
- No side effects on import;
tokenize('x')called at module load yields no observable effect beyond the returned value. - Calling
tokenize(s)twice with equalsreturns structurally-equaltokensarrays (deep equality excluding object identity).
§4. Token taxonomy
The lexer recognises seven token categories (see audit §5), implemented as Chevrotain TokenType instances via createToken. Longest-match ordering (Chevrotain’s allTokens array) pins the following priorities, from highest to lowest:
- Whitespace —
/\s+/withgroup: Lexer.SKIPPED. - Error-recovery literals (must beat
IntegerLiteral):FloatRejected—/-?[0-9]+\.[0-9]+/, custom matcher pushes error + advances past the match.UnderscoreIntegerRejected—/-?[0-9]+(?:_[0-9]+)+/, same pattern.
- StringLiteral —
/"(?:\\.|[^"\\])*"/. - Keywords (18 of them, listed in audit §5.1) — each declared as its own
TokenTypewith the fixed-string pattern.longer_alt: Identifierensures a source likerulexparses asIdentifier, notRulethenIdentifier. - Variable —
/\$[A-Za-z_][A-Za-z0-9_]*(?:\.[A-Za-z_][A-Za-z0-9_]*)*/. - Identifier —
/[\p{XID_Start}_][\p{XID_Continue}]*/u. - IntegerLiteral —
/-?[0-9]+/(negative sign binds when no preceding operand; parser disambiguates). - Operators (12, two-char before one-char) —
==,!=,<=,>=,->, then<,>,+,-,*,/,%. - Delimiters —
{,},(,),,,:,..
Rule 4 uses Chevrotain’s longer_alt feature so keyword tokens degrade gracefully when they appear as identifier prefixes. This avoids the “keywords before identifier” footgun while still preserving the ordering discipline for other tokens.
§5. Invariants
The following invariants hold for every tokenize(input) call:
| ID | Invariant | Verification |
|---|---|---|
| I1 | Function returns an ILexingResult (never throws) |
Jest expects no-throw |
| I2 | Empty input → empty tokens, empty errors | test in __tests__ |
| I3 | Whitespace-only input → empty tokens, empty errors | test |
| I4 | Every token has startLine ≥ 1, startColumn ≥ 1 (Chevrotain 1-indexed) |
test |
| I5 | Every token has image equal to the source substring [startOffset, endOffset] inclusive |
test |
| I6 | Float literal N.M produces an entry in errors at the right line/column |
test |
| I7 | Underscore integer 1_000 produces an entry in errors at the right line/column |
test |
| I8 | A keyword followed by - and an identifier does not merge across the operator |
test |
| I9 | Keyword is prefix of identifier → identifier wins (longer_alt) |
test |
| I10 | Variable’s image equals the full $… span, dots included |
test |
| I11 | Two-char operator beats one-char prefix (== not = then =) |
test |
| I12 | String literal preserves inner escapes verbatim in image |
test |
| I13 | Unicode identifier (e.g. règle, 日本語) tokenizes as a single Identifier |
test |
| I14 | Full AcceptCommitment concept-doc snippet tokenizes without errors | test (golden sequence) |
| I15 | tokenize(s) is referentially transparent (purity) |
test (run twice, deep-equal) |
§6. Error model
Lexer-level errors are reported via result.errors: ILexingError[], never thrown. Each error has:
{
offset: number; // 0-based absolute character offset
length: number; // span length
line: number; // 1-based line
column: number; // 1-based column
message: string; // human-readable reason
}
Three error sources:
- Unrecognised character — Chevrotain’s default recovery for inputs that match no token.
- Float literal rejected — custom matcher attached to
FloatRejected; message includes the offending substring and a hint ("Float literals are not supported. Use integers with basis-point scaling."). - Underscore integer rejected — same pattern, message:
"Underscore-separated integer literals are not supported. Write digits without separators.".
The lexer does not do parser-level diagnostics (missing brace, unbalanced paren, etc.) — that is strictly P1.2.2.
§7. Dependency rules
In: only chevrotain.
Out: exported types reach P1.2.2 parser (future).
Explicitly forbidden (at least until the κ engine is more complete):
- No imports from
src/db/*— the lexer must not touch SQLite. - No imports from
src/middleware/*— the lexer is outside the MCP pipeline. - No imports from
src/domains/tasks/*,src/domains/skills/*,src/domains/trail/*,src/domains/proof/*,src/domains/router/*,src/domains/integrations/*— κ is a peer axis, not a dependent. - No Node built-ins (
fs,path,crypto,os,child_process,http,net, …) — the lexer operates on an in-memory string only.
Assembly wiring (Phase 1 follow-up): the κ engine module root src/domains/rules/index.ts is not part of this task. P1.2.1 ships lexer.ts only. Downstream tasks (P1.2.2 parser, P1.2.3 interpreter, P1.2.4 registry) will add more files and the eventual index.ts barrel.
§8. Performance envelope (informational, not gated)
- Short rule (~50 tokens) tokenizes in
< 1 mson a modern laptop — Chevrotain is production-grade. - Memory:
tokensarray size proportional to input length; no caching across calls. - No memoization — callers (parser) may cache if they wish.
- This task defines no performance SLO; if SLOs are needed later, they belong in P1.2.x or ADR-007.
§9. Non-goals (re-stated from audit §10)
The contract explicitly excludes:
- Parser, AST, interpreter, evaluator.
- Comment syntax, interpolation, triple-quoted strings, heredocs.
- Float support, underscore separators, hex/octal/binary, scientific notation.
- Run-time evaluation of any rule.
- Integration with β/ε/ζ/η/ν/δ axes or any MCP tool.
- A new ADR (see drift finding in audit §3).
src/domains/rules/index.tsbarrel file.- Tokenization performance SLOs.
§10. Change log
- v1 (this commit) — initial contract.
Any subsequent change to the public surface of lexer.ts MUST land a contract revision in the same PR. Backward-incompatible changes MUST advance a minor version note here.
§11. Traceability
| Requirement | Where defined | Where tested |
|---|---|---|
| 18 keywords | audit §5.1 + contract §4 rule 4 | lexer.test.ts keyword matrix |
| 12 operators | audit §5.2 + contract §4 rules 8 | lexer.test.ts operator matrix |
| 7 categories | audit §5.3 + contract §2 | lexer.test.ts (positive matrix) |
| Line/column tracking | contract §3 / I4-5 | lexer.test.ts position test |
| Float rejected | contract §6 / I6 | lexer.test.ts “3.14 rejects” |
| Underscore int rejected | contract §6 / I7 | lexer.test.ts “1_000 rejects” |
| Unicode identifier | contract §4 rule 6 / I13 | lexer.test.ts “règle_日本語” |
| Variable with dot path | contract §4 rule 5 / I10 | lexer.test.ts “$actor.reputation.execution” |
| AcceptCommitment golden | audit §6.1 / I14 | lexer.test.ts “AcceptCommitment snippet” |
§12. Summary
src/domains/rules/lexer.ts exports a single function tokenize(input: string): ILexingResult plus typed token-type re-exports. It uses Chevrotain 11.0.3 to recognise 18 keywords, 12 operators, and 7 token categories, never throws, reports all lexical errors with 1-indexed line/column, and is strictly pure (no I/O). Float literals and underscore-separated integers are rejected with positioned errors. The contract is narrow on purpose — the parser, AST, and interpreter are separate tasks.
Next step: packet (Step 3 of 5).