R81.B / P1.2.1 — Lexer / Tokenizer — Audit

Step 1 of the 5-step executor chain (audit → contract → packet → implement → verify). First Chevrotain integration in the repo. Greenfield src/domains/rules/.

§1. Surface inventory

§1.1. Target files (greenfield)

Path Exists? Purpose
src/domains/rules/ No (greenfield — creates in impl step) κ Rule Engine source root
src/domains/rules/lexer.ts No Chevrotain-based tokenizer
src/__tests__/domains/rules/lexer.test.ts No Jest lexer tests (see §1.3 layout reconciliation)

§1.2. Touched but not owned

Path Delta Purpose
package.json add "chevrotain": "11.0.3" to dependencies Pin exact major (per task prompt + ADR-006 target)
package-lock.json regenerated by npm install Commit alongside package.json

§1.3. Test-file layout reconciliation

The task prompt places the test at src/domains/rules/__tests__/lexer.test.ts (colocated under the domain). The shipped Phase 0 convention is tests live under src/__tests__/domains/<name>/, confirmed by inspection:

  • src/__tests__/domains/router/{scoring,fallback}.test.ts
  • src/__tests__/domains/skills/{repository,capability-index}.test.ts
  • src/__tests__/domains/tasks/{repository,tools,writeback}.test.ts
  • src/__tests__/domains/proof/…, src/__tests__/domains/trail/… (same)

Jest testMatch in jest.config.ts accepts both patterns (**/__tests__/**/*.test.ts). To stay consistent with the Phase 0 corpus, the test file will live at src/__tests__/domains/rules/lexer.test.ts (not src/domains/rules/__tests__/lexer.test.ts). This is a convention reconciliation, not a spec deviation; the verification doc will re-cite.

§2. Authoritative grammar sources

The task prompt lists five pre-flight reads. One of them (docs/architecture/decisions/ADR-006-dsl-grammar.md) does not exist — see §3 drift finding. For authoritative grammar, the lexer relies on:

Source Path Weight
Heritage extraction, full EBNF docs/reference/extractions/kappa-rule-engine-extraction.md §1 Authoritative superset (per prompt)
Concept doc, EBNF fragment docs/3-world/physics/laws/rule-engine.md §DSL grammar Narrower phrasing
DSL spec docs/spec/s12-dsl.md Load-bearing, high-level
Rule engine spec docs/spec/s11-rule-engine.md Load-bearing, semantic level

Where the concept doc and the extraction differ at the surface (e.g. extraction uses guards { } / effects { } block syntax; concept uses guard: / effects: prefix), the extraction wins — the task prompt designates it as the superset and the keyword list (rule, guards, effects, …) matches the extraction.

§3. Drift finding — ADR-006-dsl-grammar does not exist

The task prompt (line 482 of docs/guides/implementation/task-prompts/p1.1-kappa-rule-engine.md) asks the agent to read docs/architecture/decisions/ADR-006-dsl-grammar.md for Chevrotain ratification. This ADR is not in the repo.

  • Actual ADR-006 in repo: docs/architecture/decisions/ADR-006-executable-meaning.md — different subject (executable-meaning, not DSL grammar).
  • The concept doc docs/3-world/physics/laws/rule-engine.md line 206 also dangling-links to ADR-006-dsl-grammar.md.
  • The ADR index (docs/architecture/decisions/index.md) should be cross-checked; it lists ADR-001–006 but none under the name dsl-grammar.

Sigma flagged this drift ahead of dispatch. Scope of this task: note the drift, do not write the ADR. Proceed using the extraction + concept + spec triad cited in §2 as the authoritative grammar.

Follow-up (post-R81 Wave 1):

  • Write ADR-007-dsl-grammar.md (or rename the mislabelled ADR-006 slot if the team decides) ratifying Chevrotain 11.x for κ parser-stack.
  • Fix the dangling ref in docs/3-world/physics/laws/rule-engine.md line 206.
  • Candidate round: R82, or a co-wave of R81 Wave 2 if Sigma schedules it.

The verification doc for this task will re-note the drift so PR reviewers see it too.

§4. Chevrotain version choice

The task prompt pins chevrotain@11.0.3 exactly. Justification:

  • Chevrotain 11.x is the stable major line at project-planning time (pinned per task-breakdown.md and the task prompt).
  • Exact-pin (no ^) in dependencies (not devDependencies) — the lexer is production code, not test scaffolding.
  • No prior Chevrotain entry in package.json (verified: grep -n chevrotain package.json → nothing).
  • @types/chevrotain is not needed in v11 — Chevrotain ships its own TypeScript types. (Pre-existing merkletreejs @types gap is unrelated.)

§5. Token inventory

18 keywords × 12 operators × 7 token categories, per task prompt acceptance criteria and the extraction §1 grammar.

§5.1. Keywords (18, longest-match priority before Identifier)

Structural: rule, guards, effects, when, then, if, else Logical: and, or, not Literal: true, false Action: admit, reject Domain anchor: admission, transition, consequence, promotion

Source: task prompt + extraction §1 GuardClause = ( Expression | "else" ) "->" Action and Action = "admit" | "reject" STRING. The domain-anchor keywords (admission, transition, consequence, promotion) come from the concept doc’s event-type vocabulary and the task-breakdown spec list.

§5.2. Operators (12)

Comparison (6): ==, !=, <=, >=, <, > Arithmetic (5): +, -, *, /, % Arrow (1): ->

Ordering discipline — the two-char ops (==, !=, <=, >=, ->) must be tokenized before their single-char prefixes (=, !, <, >, -) to avoid the one-char winning a longest-match race. Chevrotain resolves this by declaration order in the allTokens array; the impl step pins this order.

§5.3. Token categories (7 + implicit EOF)

  1. Keyword — 18 strings listed above.
  2. Identifier — Unicode \p{XID_Start}\p{XID_Continue}* with /u flag. Must follow all 18 keywords in the token array.
  3. Variable$ + dot-path: /\$[A-Za-z_][A-Za-z0-9_]*(?:\.[A-Za-z_][A-Za-z0-9_]*)*/.
  4. Integer/-?[0-9]+/ with two rejection patterns at higher priority:
    • Float /-?[0-9]+\.[0-9]+/ → custom error token → raises at lex time with position.
    • Underscore-int /-?[0-9]+(_[0-9]+)+/ → custom error token → raises at lex time with position.
  5. String — double-quoted with \\" and \\\\ escapes: /"(?:\\.|[^"\\])*"/.
  6. Operator — the 12 listed above.
  7. Delimiter{, }, (, ), ,, :, .. Plus implicit EOF — Chevrotain’s default terminal.

Whitespace ( , \t, \r, \n) is skipped (not a category; Chevrotain SKIPPED group).

Comments: not tokenized at this layer. The grammar has no comment syntax in Phase 1.

§6. Test strategy

The acceptance-criteria matrix from the task prompt maps to ~15–20 Jest tests in src/__tests__/domains/rules/lexer.test.ts. The verification doc will cite the AC → test map.

§6.1. Positive tests (≥10)

  1. rule AcceptCommitment { tokenizes as Keyword(rule) → Identifier(AcceptCommitment) → LBrace.
  2. All 18 keywords tokenize as Keyword, not Identifier, when they appear standalone.
  3. A keyword as prefix of an identifier (ruleX) tokenizes as Identifier, not rule + X.
  4. $actor.reputation.execution tokenizes as one Variable token with image $actor.reputation.execution.
  5. $actor (no dot path) tokenizes as one Variable with image $actor.
  6. Integer 123, 0, -5 each tokenize as IntegerLiteral.
  7. All 12 operators tokenize as Operator in isolation, each with the right tokenType.name.
  8. a == b and a <= b — two-char ops beat one-char at start.
  9. String "hello" tokenizes as StringLiteral with image "hello".
  10. String "he\\"llo" and "he\\\\llo" — escape handling.
  11. Unicode identifier règle_日本語 tokenizes as one Identifier.
  12. Full AcceptCommitment snippet from rule-engine.md concept tokenizes end-to-end without errors; token-name sequence matches a golden list.
  13. Line/column tracked: token at row 3, col 5 reports {startLine: 3, startColumn: 5}.

§6.2. Rejection tests (negative — each must produce a lex error with position)

  1. 3.14 → lex error, position {line: 1, column: 1}.
  2. 1_000_000 → lex error, position {line: 1, column: 1}.
  3. Multi-line: $actor\n 3.14\n next → error on line 2, column 3.
  4. Unknown char @ → Chevrotain default “unexpected character” error.

§6.3. Property / boundary tests (3–5)

  1. Empty input → zero tokens, zero errors.
  2. Whitespace-only input → zero tokens, zero errors.
  3. Line + column are 1-indexed (Chevrotain default — verify in one test to lock behavior).

Target: 15–25 tests, 100% coverage on src/domains/rules/lexer.ts.

§7. Dependency graph

New dependency (impl step only):

  • chevrotain@11.0.3 — added to dependencies. Chevrotain 11 ships its own TS types.

No change to existing deps. No upgrades, no transitive conflicts expected (Chevrotain has zero runtime deps).

Internal imports in lexer.ts:

  • Only chevrotain. No imports from other src/domains/, src/db/, or src/middleware/. κ lexer is a pure function — it neither reads the DB nor touches MCP transport.

Who imports the lexer (future):

  • P1.2.2 Parser will import { tokenize, allTokens } from './lexer.js';.
  • No callers in Phase 0 β / ε / ζ / η / ν / δ axes.

§8. Commit plan

Step Files touched Commit
1 (this doc) docs/audits/r81-b-p1-2-1-lexer-audit.md audit(r81-b-p1-2-1-lexer): inventory lexer surface + drift flag
2 docs/contracts/r81-b-p1-2-1-lexer-contract.md contract(r81-b-p1-2-1-lexer): DSL lexer contract
3 docs/packets/r81-b-p1-2-1-lexer-packet.md packet(r81-b-p1-2-1-lexer): execution plan + token matrix
4 package.json + package-lock.json + src/domains/rules/lexer.ts + src/__tests__/domains/rules/lexer.test.ts feat(r81-b-p1-2-1-lexer): Chevrotain-based κ DSL lexer (18 keywords, 12 ops, 7 categories)
5 docs/verification/r81-b-p1-2-1-lexer-verification.md verify(r81-b-p1-2-1-lexer): test evidence + tokenization traces

§9. Quality-gate plan

Per CLAUDE.md §5 — all three commands must pass:

npm run build       # tsc, strict
npm run lint        # eslint src
npm test            # jest --coverage

Pre-existing baseline at main 77e579b8: 1085 tests in 26+ suites. Post-task target: 1085 + N where N is the lexer test count (15–25). Pre-existing startup — subprocess smoke flake is not a regression.

§10. Non-goals

  • No parser (that is P1.2.2 — the next task in R81 Wave 2 or beyond).
  • No AST construction; lexer.ts returns Chevrotain’s native ILexingResult.
  • No integration with the β task pipeline, ε skill registry, ζ decision trail, or any MCP tool.
  • No interpreter, evaluator, rule-registry, or rule-version hash logic.
  • No comment syntax, no multi-line strings, no string interpolation, no triple-quoted strings.
  • No float support (deliberately rejected — see §5.3).
  • No underscore-as-digit-separator (deliberately rejected — see §5.3).
  • No LSP, formatter, or syntax-highlighting ancillaries.
  • No ADR rewrite / ADR-007 creation — drift flagged for follow-up only.

§11. Risks + gotchas captured upfront

  • Keyword-before-identifier order — if Identifier appears before the keyword tokens in allTokens, rule tokenizes as Identifier. Prevented by declaring all 18 keywords first.
  • Variable needs custom pattern — Chevrotain’s default Identifier pattern does not allow the $ prefix. The Variable token uses an explicit regex (see §5.3).
  • Unicode regex gotcha — without /u flag, \p{XID_Start} does not work in JS regex. The identifier regex uses the u flag explicitly.
  • Two-char operator ordering== must beat = (though we don’t declare = as a separate token in Phase 1 since assignment is not in the grammar — guards are only). Still, != > !, <= > <, >= > >, -> > -. We declare longer first.
  • Float and underscore-int ordering — the error-recovery patterns (float-rejected, underscore-int-rejected) must match before the normal Integer pattern. They are declared before IntegerLiteral in the token array.
  • Jest ESM + ts-jest — existing config works (router/skills/tasks tests pass). No config change needed.
  • No regressions in existing 1085 tests — new domain, no edits to existing src files.

§12. Summary

Greenfield Phase 1 κ lexer task. Only production dep change: add chevrotain@11.0.3. All grammar sourced from extraction §1 (authoritative superset) + concept doc + s11/s12 specs. One drift finding flagged: ADR-006-dsl-grammar.md does not exist — follow-up only. Tests colocate under src/__tests__/domains/rules/ per repo convention, not the prompt’s nested path. Estimated 15–25 new tests, 100% coverage, zero regressions.

Next step: contract (Step 2 of 5).


Back to top

Colibri — documentation-first MCP runtime. Apache 2.0 + Commons Clause.

This site uses Just the Docs, a documentation theme for Jekyll.