P1.5.4 — Canonical Serialization — Contract
Step 2 of the 5-step chain (CLAUDE.md §6). Approval gate: this contract drives Step 3 (packet) and Step 4 (implement).
§1. Module identity
Module path: src/domains/rules/canonical.ts
Test path: src/__tests__/domains/rules/canonical.test.ts
Outbound imports: none (no parser, no lexer — pure module). Tests will import the parser to express round-trip properties; the canonicalizer itself is dependency-free.
§2. Public surface
2.1 canonicalize(value: unknown): string
Walk value and return the byte-canonical JSON encoding (single-line, no whitespace, sorted object keys, deterministic).
2.2 byteLength(value: unknown): number
Return the UTF-8 byte length of canonicalize(value). Test/diagnostic helper backed by Buffer.byteLength(...).
2.3 No other exports
Internal helpers stay file-local.
§3. Invariants
I1. Determinism (load-bearing)
For all value for which canonicalize returns successfully:
canonicalize(value) === canonicalize(value)for repeated calls in the same process.canonicalize(value) === canonicalize(structuralClone(value))wherestructuralClonepreserves all values but may reorder object keys at insertion time.- Two distinct processes (any platform, any Node version ≥ 20) produce byte-identical output for the same
value.
I2. Sorted keys at every object level
Whenever the walker encodes an object, its keys are emitted in Array.prototype.sort() (default comparator) order. The default comparator on strings is the JavaScript abstract < operator, which compares 16-bit UTF-16 code units pairwise — equivalent to codepoint order for keys composed of BMP characters and well-defined for surrogate-pair keys.
Forbidden: Intl.Collator, String.prototype.localeCompare, any locale-aware comparator.
I3. No whitespace
Output contains only the canonical structural characters ({, }, [, ], ,, :, ", escape sequences, digits, sign, null, true, false) plus the literal characters of strings. No spaces, no newlines, no tabs between tokens.
A \n inside a string literal is encoded as the two-character escape sequence \n (i.e. backslash + n); the canonical output has no raw newline character.
I4. Integer literal form
bigint → decimal-string toString form, no n suffix.
number (only integers per I9) → decimal-string toString form, no 1e3 notation.
No leading zeros, no decimal point, no exponent. (BigInt’s toString() never emits leading zeros.)
I5. String escape canonical form
Each character c of an input string is encoded as follows:
| Codepoint | Output |
|---|---|
0x22 (") |
\" |
0x5C (\) |
\\ |
| 0x08 (BS) | \b |
| 0x0C (FF) | \f |
| 0x0A (LF) | \n |
| 0x0D (CR) | \r |
| 0x09 (HT) | \t |
| 0x00–0x1F (other) | \u00XX (lowercase hex, 4 hex digits, zero-padded) |
| ≥ 0x20 | the literal character (UTF-16 code unit) |
Note: lowercase hex is chosen to match JSON.stringify’s output and reduce diff churn against existing tooling. The / character is not escaped (the \/ form is legal JSON but optional; we don’t use it).
The string is wrapped in double quotes.
I6. Idempotence
Two successive canonicalizations on the same input produce byte-identical output:
canonicalize(canonicalize(x)) === '"' + escape(canonicalize(x)) + '"'
That is: feeding the JSON string back to canonicalize wraps it as a JSON string literal. (Because canonicalize sees string and emits a quoted form.)
The stronger AST-level property: if parse(dsl) returns RuleNode[], then
canonicalize(parse(dsl).ast) === canonicalize(parse(dsl).ast)
across separate calls and processes. (Trivially satisfied by I1, but the property test exercises 100 random ASTs.)
I7. Locale independence
canonicalize consults no locale state. Setting process.env.LANG or any Intl.* default at any point during the test run does not alter output.
I8. Error model
For inputs the encoder cannot represent, throw a CanonicalSerializationError (extends Error). Reasons:
undefinedvalue (JSON has noundefined)functionvalue- non-integer
number(NaN, Infinity, or fractional) - Symbol value
- Non-plain object (Map, Set, Date, RegExp, Promise, class instance whose prototype isn’t
Object.prototypeornull) - Cycles (detected by a Set of object refs along the descent path; bound the recursion)
Errors mention the offending value’s runtime tag and a path string (e.g. at /guards/0/condition). Errors are deterministic given the input.
I9. Pure module
No I/O, no process.* reads, no env reads, no logging, no Date.*, no Math.*, no async. The module passes the determinism corpus self-scan in determinism.test.ts §Group 12.
I10. Stable byteLength
byteLength(value) === Buffer.byteLength(canonicalize(value), 'utf8'). This is a tautology by definition; documented so future refactors don’t decouple them.
§4. Behavioural test matrix (8 fixtures)
Mapped from §IMPLEMENTATION SPEC F1–F7 plus an error-model fixture.
| ID | Description | Acceptance |
|---|---|---|
| F1 | Object with two keys, inserted reverse-alpha | Output keys sorted: {"a":2,"b":1} |
| F2 | bigint 13n |
Output 13 (no n, no exponent) |
| F3 | String 'hello\nworld' |
Output "hello\nworld" (escape sequence, single line) |
| F4 | Nested array of objects with un-sorted keys | Each object internally sorted |
| F5 | Real DSL → parse via P1.2.2 → canonicalize twice → equal | canonicalize(ast)===canonicalize(ast) |
| F6 | Locale forced to Turkish | Output unchanged from default-locale run |
| F7 | Property: 100 random AST shapes; canonicalize twice idempotent | All 100 pass |
| F8 | Error model: undefined, Symbol, function, NaN, Map, Date all throw CanonicalSerializationError |
Each throws with informative message |
Plus baseline coverage:
null,true,false, empty{}, empty[], deeply-nested mixed- Negative bigint
- Strings containing every escape (backslash, quote, control char)
- Object key sorting through a randomized insertion order
- Cycle detection (object holding a ref to itself)
§5. Acceptance criteria (CI gate)
npm run build— TypeScript compilescanonical.tscleanly under strict mode.npm run lint— ESLint reports no errors incanonical.tsor its tests.npm test— every fixture above passes; existing 1467-test baseline is preserved (no regression).- Coverage on
canonical.ts≥ 95% lines, 90% branches. - Determinism corpus self-scan (
determinism.test.ts§Group 12) still green —canonical.tsintroduces no forbidden tokens.
§6. Non-functional
- The encoder must be O(n) in input size where n is the total node count (sort is the only super-linear step, and it’s O(k log k) per object level).
- No global state. Re-entrant. Safe to call concurrently (Phase 1 is single-threaded but the contract permits future use).
- No memoization, no caching across calls.
§7. Out of scope
- Decoding JSON (out of scope; consumers use
JSON.parse) - The hash function in P1.5.1 (canonical’s output is its INPUT)
- A streaming variant (Phase 1 corpus easily fits memory)
§8. Open questions
None. The audit closed every spec question. Sign-off: ready to write packet.