R81.B / P1.2.1 — Lexer / Tokenizer — Audit

Step 1 of the 5-step executor chain (audit → contract → packet → implement → verify). First Chevrotain integration in the repo. Greenfield src/domains/rules/.

§1. Surface inventory

§1.1. Target files (greenfield)

Path	Exists?	Purpose
`src/domains/rules/`	No (greenfield — creates in impl step)	κ Rule Engine source root
`src/domains/rules/lexer.ts`	No	Chevrotain-based tokenizer
`src/__tests__/domains/rules/lexer.test.ts`	No	Jest lexer tests (see §1.3 layout reconciliation)

§1.2. Touched but not owned

Path	Delta	Purpose
`package.json`	add `"chevrotain": "11.0.3"` to `dependencies`	Pin exact major (per task prompt + ADR-006 target)
`package-lock.json`	regenerated by `npm install`	Commit alongside `package.json`

§1.3. Test-file layout reconciliation

The task prompt places the test at src/domains/rules/__tests__/lexer.test.ts (colocated under the domain). The shipped Phase 0 convention is tests live under src/__tests__/domains/<name>/, confirmed by inspection:

src/__tests__/domains/router/{scoring,fallback}.test.ts
src/__tests__/domains/skills/{repository,capability-index}.test.ts
src/__tests__/domains/tasks/{repository,tools,writeback}.test.ts
src/__tests__/domains/proof/…, src/__tests__/domains/trail/… (same)

Jest testMatch in jest.config.ts accepts both patterns (**/__tests__/**/*.test.ts). To stay consistent with the Phase 0 corpus, the test file will live at src/__tests__/domains/rules/lexer.test.ts (not src/domains/rules/__tests__/lexer.test.ts). This is a convention reconciliation, not a spec deviation; the verification doc will re-cite.

§2. Authoritative grammar sources

The task prompt lists five pre-flight reads. One of them (docs/architecture/decisions/ADR-006-dsl-grammar.md) does not exist — see §3 drift finding. For authoritative grammar, the lexer relies on:

Source	Path	Weight
Heritage extraction, full EBNF	`docs/reference/extractions/kappa-rule-engine-extraction.md` §1	Authoritative superset (per prompt)
Concept doc, EBNF fragment	`docs/3-world/physics/laws/rule-engine.md` §DSL grammar	Narrower phrasing
DSL spec	`docs/spec/s12-dsl.md`	Load-bearing, high-level
Rule engine spec	`docs/spec/s11-rule-engine.md`	Load-bearing, semantic level

Where the concept doc and the extraction differ at the surface (e.g. extraction uses guards { } / effects { } block syntax; concept uses guard: / effects: prefix), the extraction wins — the task prompt designates it as the superset and the keyword list (rule, guards, effects, …) matches the extraction.

§3. Drift finding — ADR-006-dsl-grammar does not exist

The task prompt (line 482 of docs/guides/implementation/task-prompts/p1.1-kappa-rule-engine.md) asks the agent to read docs/architecture/decisions/ADR-006-dsl-grammar.md for Chevrotain ratification. This ADR is not in the repo.

Actual ADR-006 in repo: docs/architecture/decisions/ADR-006-executable-meaning.md — different subject (executable-meaning, not DSL grammar).
The concept doc docs/3-world/physics/laws/rule-engine.md line 206 also dangling-links to ADR-006-dsl-grammar.md.
The ADR index (docs/architecture/decisions/index.md) should be cross-checked; it lists ADR-001–006 but none under the name dsl-grammar.

Sigma flagged this drift ahead of dispatch. Scope of this task: note the drift, do not write the ADR. Proceed using the extraction + concept + spec triad cited in §2 as the authoritative grammar.

Follow-up (post-R81 Wave 1):

Write ADR-007-dsl-grammar.md (or rename the mislabelled ADR-006 slot if the team decides) ratifying Chevrotain 11.x for κ parser-stack.
Fix the dangling ref in docs/3-world/physics/laws/rule-engine.md line 206.
Candidate round: R82, or a co-wave of R81 Wave 2 if Sigma schedules it.

The verification doc for this task will re-note the drift so PR reviewers see it too.

§4. Chevrotain version choice

The task prompt pins chevrotain@11.0.3 exactly. Justification:

Chevrotain 11.x is the stable major line at project-planning time (pinned per task-breakdown.md and the task prompt).
Exact-pin (no ^) in dependencies (not devDependencies) — the lexer is production code, not test scaffolding.
No prior Chevrotain entry in package.json (verified: grep -n chevrotain package.json → nothing).
@types/chevrotain is not needed in v11 — Chevrotain ships its own TypeScript types. (Pre-existing merkletreejs @types gap is unrelated.)

§5. Token inventory

18 keywords × 12 operators × 7 token categories, per task prompt acceptance criteria and the extraction §1 grammar.

§5.1. Keywords (18, longest-match priority before Identifier)

Structural: rule, guards, effects, when, then, if, else Logical: and, or, not Literal: true, false Action: admit, reject Domain anchor: admission, transition, consequence, promotion

Source: task prompt + extraction §1 GuardClause = ( Expression | "else" ) "->" Action and Action = "admit" | "reject" STRING. The domain-anchor keywords (admission, transition, consequence, promotion) come from the concept doc’s event-type vocabulary and the task-breakdown spec list.

§5.2. Operators (12)

Comparison (6): ==, !=, <=, >=, <, > Arithmetic (5): +, -, *, /, % Arrow (1): ->

Ordering discipline — the two-char ops (==, !=, <=, >=, ->) must be tokenized before their single-char prefixes (=, !, <, >, -) to avoid the one-char winning a longest-match race. Chevrotain resolves this by declaration order in the allTokens array; the impl step pins this order.

§5.3. Token categories (7 + implicit EOF)

Keyword — 18 strings listed above.
Identifier — Unicode \p{XID_Start}\p{XID_Continue}* with /u flag. Must follow all 18 keywords in the token array.
Variable — $ + dot-path: /\$[A-Za-z_][A-Za-z0-9_]*(?:\.[A-Za-z_][A-Za-z0-9_]*)*/.
Integer — /-?[0-9]+/ with two rejection patterns at higher priority:
- Float /-?[0-9]+\.[0-9]+/ → custom error token → raises at lex time with position.
- Underscore-int /-?[0-9]+(_[0-9]+)+/ → custom error token → raises at lex time with position.
String — double-quoted with \\" and \\\\ escapes: /"(?:\\.|[^"\\])*"/.
Operator — the 12 listed above.
Delimiter — {, }, (, ), ,, :, .. Plus implicit EOF — Chevrotain’s default terminal.

Whitespace ( , \t, \r, \n) is skipped (not a category; Chevrotain SKIPPED group).

Comments: not tokenized at this layer. The grammar has no comment syntax in Phase 1.

§6. Test strategy

The acceptance-criteria matrix from the task prompt maps to ~15–20 Jest tests in src/__tests__/domains/rules/lexer.test.ts. The verification doc will cite the AC → test map.

§6.1. Positive tests (≥10)

rule AcceptCommitment { tokenizes as Keyword(rule) → Identifier(AcceptCommitment) → LBrace.
All 18 keywords tokenize as Keyword, not Identifier, when they appear standalone.
A keyword as prefix of an identifier (ruleX) tokenizes as Identifier, not rule + X.
$actor.reputation.execution tokenizes as one Variable token with image $actor.reputation.execution.
$actor (no dot path) tokenizes as one Variable with image $actor.
Integer 123, 0, -5 each tokenize as IntegerLiteral.
All 12 operators tokenize as Operator in isolation, each with the right tokenType.name.
a == b and a <= b — two-char ops beat one-char at start.
String "hello" tokenizes as StringLiteral with image "hello".
String "he\\"llo" and "he\\\\llo" — escape handling.
Unicode identifier règle_日本語 tokenizes as one Identifier.
Full AcceptCommitment snippet from rule-engine.md concept tokenizes end-to-end without errors; token-name sequence matches a golden list.
Line/column tracked: token at row 3, col 5 reports {startLine: 3, startColumn: 5}.

§6.2. Rejection tests (negative — each must produce a lex error with position)

3.14 → lex error, position {line: 1, column: 1}.
1_000_000 → lex error, position {line: 1, column: 1}.
Multi-line: $actor\n 3.14\n next → error on line 2, column 3.
Unknown char @ → Chevrotain default “unexpected character” error.

§6.3. Property / boundary tests (3–5)

Empty input → zero tokens, zero errors.
Whitespace-only input → zero tokens, zero errors.
Line + column are 1-indexed (Chevrotain default — verify in one test to lock behavior).

Target: 15–25 tests, 100% coverage on src/domains/rules/lexer.ts.

§7. Dependency graph

New dependency (impl step only):

chevrotain@11.0.3 — added to dependencies. Chevrotain 11 ships its own TS types.

No change to existing deps. No upgrades, no transitive conflicts expected (Chevrotain has zero runtime deps).

Internal imports in lexer.ts:

Only chevrotain. No imports from other src/domains/, src/db/, or src/middleware/. κ lexer is a pure function — it neither reads the DB nor touches MCP transport.

Who imports the lexer (future):

P1.2.2 Parser will import { tokenize, allTokens } from './lexer.js';.
No callers in Phase 0 β / ε / ζ / η / ν / δ axes.

§8. Commit plan

Step	Files touched	Commit
1 (this doc)	`docs/audits/r81-b-p1-2-1-lexer-audit.md`	`audit(r81-b-p1-2-1-lexer): inventory lexer surface + drift flag`
2	`docs/contracts/r81-b-p1-2-1-lexer-contract.md`	`contract(r81-b-p1-2-1-lexer): DSL lexer contract`
3	`docs/packets/r81-b-p1-2-1-lexer-packet.md`	`packet(r81-b-p1-2-1-lexer): execution plan + token matrix`
4	`package.json` + `package-lock.json` + `src/domains/rules/lexer.ts` + `src/__tests__/domains/rules/lexer.test.ts`	`feat(r81-b-p1-2-1-lexer): Chevrotain-based κ DSL lexer (18 keywords, 12 ops, 7 categories)`
5	`docs/verification/r81-b-p1-2-1-lexer-verification.md`	`verify(r81-b-p1-2-1-lexer): test evidence + tokenization traces`

§9. Quality-gate plan

Per CLAUDE.md §5 — all three commands must pass:

npm run build       # tsc, strict
npm run lint        # eslint src
npm test            # jest --coverage

Pre-existing baseline at main 77e579b8: 1085 tests in 26+ suites. Post-task target: 1085 + N where N is the lexer test count (15–25). Pre-existing startup — subprocess smoke flake is not a regression.

§10. Non-goals

No parser (that is P1.2.2 — the next task in R81 Wave 2 or beyond).
No AST construction; lexer.ts returns Chevrotain’s native ILexingResult.
No integration with the β task pipeline, ε skill registry, ζ decision trail, or any MCP tool.
No interpreter, evaluator, rule-registry, or rule-version hash logic.
No comment syntax, no multi-line strings, no string interpolation, no triple-quoted strings.
No float support (deliberately rejected — see §5.3).
No underscore-as-digit-separator (deliberately rejected — see §5.3).
No LSP, formatter, or syntax-highlighting ancillaries.
No ADR rewrite / ADR-007 creation — drift flagged for follow-up only.

§11. Risks + gotchas captured upfront

Keyword-before-identifier order — if Identifier appears before the keyword tokens in allTokens, rule tokenizes as Identifier. Prevented by declaring all 18 keywords first.
Variable needs custom pattern — Chevrotain’s default Identifier pattern does not allow the $ prefix. The Variable token uses an explicit regex (see §5.3).
Unicode regex gotcha — without /u flag, \p{XID_Start} does not work in JS regex. The identifier regex uses the u flag explicitly.
Two-char operator ordering — == must beat = (though we don’t declare = as a separate token in Phase 1 since assignment is not in the grammar — guards are → only). Still, != > !, <= > <, >= > >, -> > -. We declare longer first.
Float and underscore-int ordering — the error-recovery patterns (float-rejected, underscore-int-rejected) must match before the normal Integer pattern. They are declared before IntegerLiteral in the token array.
Jest ESM + ts-jest — existing config works (router/skills/tasks tests pass). No config change needed.
No regressions in existing 1085 tests — new domain, no edits to existing src files.

§12. Summary

Greenfield Phase 1 κ lexer task. Only production dep change: add chevrotain@11.0.3. All grammar sourced from extraction §1 (authoritative superset) + concept doc + s11/s12 specs. One drift finding flagged: ADR-006-dsl-grammar.md does not exist — follow-up only. Tests colocate under src/__tests__/domains/rules/ per repo convention, not the prompt’s nested path. Estimated 15–25 new tests, 100% coverage, zero regressions.

Next step: contract (Step 2 of 5).