R81.B / P1.2.1 — Lexer / Tokenizer — Audit
Step 1 of the 5-step executor chain (audit → contract → packet → implement → verify). First Chevrotain integration in the repo. Greenfield
src/domains/rules/.
§1. Surface inventory
§1.1. Target files (greenfield)
| Path | Exists? | Purpose |
|---|---|---|
src/domains/rules/ |
No (greenfield — creates in impl step) | κ Rule Engine source root |
src/domains/rules/lexer.ts |
No | Chevrotain-based tokenizer |
src/__tests__/domains/rules/lexer.test.ts |
No | Jest lexer tests (see §1.3 layout reconciliation) |
§1.2. Touched but not owned
| Path | Delta | Purpose |
|---|---|---|
package.json |
add "chevrotain": "11.0.3" to dependencies |
Pin exact major (per task prompt + ADR-006 target) |
package-lock.json |
regenerated by npm install |
Commit alongside package.json |
§1.3. Test-file layout reconciliation
The task prompt places the test at src/domains/rules/__tests__/lexer.test.ts (colocated under the domain). The shipped Phase 0 convention is tests live under src/__tests__/domains/<name>/, confirmed by inspection:
src/__tests__/domains/router/{scoring,fallback}.test.tssrc/__tests__/domains/skills/{repository,capability-index}.test.tssrc/__tests__/domains/tasks/{repository,tools,writeback}.test.tssrc/__tests__/domains/proof/…,src/__tests__/domains/trail/…(same)
Jest testMatch in jest.config.ts accepts both patterns (**/__tests__/**/*.test.ts). To stay consistent with the Phase 0 corpus, the test file will live at src/__tests__/domains/rules/lexer.test.ts (not src/domains/rules/__tests__/lexer.test.ts). This is a convention reconciliation, not a spec deviation; the verification doc will re-cite.
§2. Authoritative grammar sources
The task prompt lists five pre-flight reads. One of them (docs/architecture/decisions/ADR-006-dsl-grammar.md) does not exist — see §3 drift finding. For authoritative grammar, the lexer relies on:
| Source | Path | Weight |
|---|---|---|
| Heritage extraction, full EBNF | docs/reference/extractions/kappa-rule-engine-extraction.md §1 |
Authoritative superset (per prompt) |
| Concept doc, EBNF fragment | docs/3-world/physics/laws/rule-engine.md §DSL grammar |
Narrower phrasing |
| DSL spec | docs/spec/s12-dsl.md |
Load-bearing, high-level |
| Rule engine spec | docs/spec/s11-rule-engine.md |
Load-bearing, semantic level |
Where the concept doc and the extraction differ at the surface (e.g. extraction uses guards { } / effects { } block syntax; concept uses guard: / effects: prefix), the extraction wins — the task prompt designates it as the superset and the keyword list (rule, guards, effects, …) matches the extraction.
§3. Drift finding — ADR-006-dsl-grammar does not exist
The task prompt (line 482 of docs/guides/implementation/task-prompts/p1.1-kappa-rule-engine.md) asks the agent to read docs/architecture/decisions/ADR-006-dsl-grammar.md for Chevrotain ratification. This ADR is not in the repo.
- Actual ADR-006 in repo:
docs/architecture/decisions/ADR-006-executable-meaning.md— different subject (executable-meaning, not DSL grammar). - The concept doc
docs/3-world/physics/laws/rule-engine.mdline 206 also dangling-links toADR-006-dsl-grammar.md. - The ADR index (
docs/architecture/decisions/index.md) should be cross-checked; it lists ADR-001–006 but none under the namedsl-grammar.
Sigma flagged this drift ahead of dispatch. Scope of this task: note the drift, do not write the ADR. Proceed using the extraction + concept + spec triad cited in §2 as the authoritative grammar.
Follow-up (post-R81 Wave 1):
- Write
ADR-007-dsl-grammar.md(or rename the mislabelledADR-006slot if the team decides) ratifying Chevrotain 11.x for κ parser-stack. - Fix the dangling ref in
docs/3-world/physics/laws/rule-engine.mdline 206. - Candidate round: R82, or a co-wave of R81 Wave 2 if Sigma schedules it.
The verification doc for this task will re-note the drift so PR reviewers see it too.
§4. Chevrotain version choice
The task prompt pins chevrotain@11.0.3 exactly. Justification:
- Chevrotain 11.x is the stable major line at project-planning time (pinned per task-breakdown.md and the task prompt).
- Exact-pin (no
^) independencies(notdevDependencies) — the lexer is production code, not test scaffolding. - No prior Chevrotain entry in
package.json(verified:grep -n chevrotain package.json→ nothing). @types/chevrotainis not needed in v11 — Chevrotain ships its own TypeScript types. (Pre-existingmerkletreejs@typesgap is unrelated.)
§5. Token inventory
18 keywords × 12 operators × 7 token categories, per task prompt acceptance criteria and the extraction §1 grammar.
§5.1. Keywords (18, longest-match priority before Identifier)
Structural: rule, guards, effects, when, then, if, else
Logical: and, or, not
Literal: true, false
Action: admit, reject
Domain anchor: admission, transition, consequence, promotion
Source: task prompt + extraction §1 GuardClause = ( Expression | "else" ) "->" Action and Action = "admit" | "reject" STRING. The domain-anchor keywords (admission, transition, consequence, promotion) come from the concept doc’s event-type vocabulary and the task-breakdown spec list.
§5.2. Operators (12)
Comparison (6): ==, !=, <=, >=, <, >
Arithmetic (5): +, -, *, /, %
Arrow (1): ->
Ordering discipline — the two-char ops (==, !=, <=, >=, ->) must be tokenized before their single-char prefixes (=, !, <, >, -) to avoid the one-char winning a longest-match race. Chevrotain resolves this by declaration order in the allTokens array; the impl step pins this order.
§5.3. Token categories (7 + implicit EOF)
- Keyword — 18 strings listed above.
- Identifier — Unicode
\p{XID_Start}\p{XID_Continue}*with/uflag. Must follow all 18 keywords in the token array. - Variable —
$+ dot-path:/\$[A-Za-z_][A-Za-z0-9_]*(?:\.[A-Za-z_][A-Za-z0-9_]*)*/. - Integer —
/-?[0-9]+/with two rejection patterns at higher priority:- Float
/-?[0-9]+\.[0-9]+/→ custom error token → raises at lex time with position. - Underscore-int
/-?[0-9]+(_[0-9]+)+/→ custom error token → raises at lex time with position.
- Float
- String — double-quoted with
\\"and\\\\escapes:/"(?:\\.|[^"\\])*"/. - Operator — the 12 listed above.
- Delimiter —
{,},(,),,,:,.. Plus implicit EOF — Chevrotain’s default terminal.
Whitespace ( , \t, \r, \n) is skipped (not a category; Chevrotain SKIPPED group).
Comments: not tokenized at this layer. The grammar has no comment syntax in Phase 1.
§6. Test strategy
The acceptance-criteria matrix from the task prompt maps to ~15–20 Jest tests in src/__tests__/domains/rules/lexer.test.ts. The verification doc will cite the AC → test map.
§6.1. Positive tests (≥10)
rule AcceptCommitment {tokenizes as Keyword(rule) → Identifier(AcceptCommitment) → LBrace.- All 18 keywords tokenize as Keyword, not Identifier, when they appear standalone.
- A keyword as prefix of an identifier (
ruleX) tokenizes as Identifier, notrule+X. $actor.reputation.executiontokenizes as one Variable token with image$actor.reputation.execution.$actor(no dot path) tokenizes as one Variable with image$actor.- Integer
123,0,-5each tokenize as IntegerLiteral. - All 12 operators tokenize as Operator in isolation, each with the right
tokenType.name. a == banda <= b— two-char ops beat one-char at start.- String
"hello"tokenizes as StringLiteral with image"hello". - String
"he\\"llo"and"he\\\\llo"— escape handling. - Unicode identifier
règle_日本語tokenizes as one Identifier. - Full AcceptCommitment snippet from rule-engine.md concept tokenizes end-to-end without errors; token-name sequence matches a golden list.
- Line/column tracked: token at row 3, col 5 reports
{startLine: 3, startColumn: 5}.
§6.2. Rejection tests (negative — each must produce a lex error with position)
3.14→ lex error, position{line: 1, column: 1}.1_000_000→ lex error, position{line: 1, column: 1}.- Multi-line:
$actor\n 3.14\n next→ error on line 2, column 3. - Unknown char
@→ Chevrotain default “unexpected character” error.
§6.3. Property / boundary tests (3–5)
- Empty input → zero tokens, zero errors.
- Whitespace-only input → zero tokens, zero errors.
- Line + column are 1-indexed (Chevrotain default — verify in one test to lock behavior).
Target: 15–25 tests, 100% coverage on src/domains/rules/lexer.ts.
§7. Dependency graph
New dependency (impl step only):
chevrotain@11.0.3— added todependencies. Chevrotain 11 ships its own TS types.
No change to existing deps. No upgrades, no transitive conflicts expected (Chevrotain has zero runtime deps).
Internal imports in lexer.ts:
- Only
chevrotain. No imports from othersrc/domains/,src/db/, orsrc/middleware/. κ lexer is a pure function — it neither reads the DB nor touches MCP transport.
Who imports the lexer (future):
- P1.2.2 Parser will
import { tokenize, allTokens } from './lexer.js';. - No callers in Phase 0 β / ε / ζ / η / ν / δ axes.
§8. Commit plan
| Step | Files touched | Commit |
|---|---|---|
| 1 (this doc) | docs/audits/r81-b-p1-2-1-lexer-audit.md |
audit(r81-b-p1-2-1-lexer): inventory lexer surface + drift flag |
| 2 | docs/contracts/r81-b-p1-2-1-lexer-contract.md |
contract(r81-b-p1-2-1-lexer): DSL lexer contract |
| 3 | docs/packets/r81-b-p1-2-1-lexer-packet.md |
packet(r81-b-p1-2-1-lexer): execution plan + token matrix |
| 4 | package.json + package-lock.json + src/domains/rules/lexer.ts + src/__tests__/domains/rules/lexer.test.ts |
feat(r81-b-p1-2-1-lexer): Chevrotain-based κ DSL lexer (18 keywords, 12 ops, 7 categories) |
| 5 | docs/verification/r81-b-p1-2-1-lexer-verification.md |
verify(r81-b-p1-2-1-lexer): test evidence + tokenization traces |
§9. Quality-gate plan
Per CLAUDE.md §5 — all three commands must pass:
npm run build # tsc, strict
npm run lint # eslint src
npm test # jest --coverage
Pre-existing baseline at main 77e579b8: 1085 tests in 26+ suites. Post-task target: 1085 + N where N is the lexer test count (15–25). Pre-existing startup — subprocess smoke flake is not a regression.
§10. Non-goals
- No parser (that is P1.2.2 — the next task in R81 Wave 2 or beyond).
- No AST construction;
lexer.tsreturns Chevrotain’s nativeILexingResult. - No integration with the β task pipeline, ε skill registry, ζ decision trail, or any MCP tool.
- No interpreter, evaluator, rule-registry, or rule-version hash logic.
- No comment syntax, no multi-line strings, no string interpolation, no triple-quoted strings.
- No float support (deliberately rejected — see §5.3).
- No underscore-as-digit-separator (deliberately rejected — see §5.3).
- No LSP, formatter, or syntax-highlighting ancillaries.
- No ADR rewrite / ADR-007 creation — drift flagged for follow-up only.
§11. Risks + gotchas captured upfront
- Keyword-before-identifier order — if
Identifierappears before the keyword tokens inallTokens,ruletokenizes as Identifier. Prevented by declaring all 18 keywords first. - Variable needs custom pattern — Chevrotain’s default
Identifierpattern does not allow the$prefix. The Variable token uses an explicit regex (see §5.3). - Unicode regex gotcha — without
/uflag,\p{XID_Start}does not work in JS regex. The identifier regex uses theuflag explicitly. - Two-char operator ordering —
==must beat=(though we don’t declare=as a separate token in Phase 1 since assignment is not in the grammar — guards are→only). Still,!=>!,<=><,>=>>,->>-. We declare longer first. - Float and underscore-int ordering — the error-recovery patterns (float-rejected, underscore-int-rejected) must match before the normal Integer pattern. They are declared before
IntegerLiteralin the token array. - Jest ESM + ts-jest — existing config works (router/skills/tasks tests pass). No config change needed.
- No regressions in existing 1085 tests — new domain, no edits to existing src files.
§12. Summary
Greenfield Phase 1 κ lexer task. Only production dep change: add chevrotain@11.0.3. All grammar sourced from extraction §1 (authoritative superset) + concept doc + s11/s12 specs. One drift finding flagged: ADR-006-dsl-grammar.md does not exist — follow-up only. Tests colocate under src/__tests__/domains/rules/ per repo convention, not the prompt’s nested path. Estimated 15–25 new tests, 100% coverage, zero regressions.
Next step: contract (Step 2 of 5).