P1.5.4 — Canonical Serialization — Contract

Step 2 of the 5-step chain (CLAUDE.md §6). Approval gate: this contract drives Step 3 (packet) and Step 4 (implement).

§1. Module identity

Module path: src/domains/rules/canonical.ts Test path: src/__tests__/domains/rules/canonical.test.ts

Outbound imports: none (no parser, no lexer — pure module). Tests will import the parser to express round-trip properties; the canonicalizer itself is dependency-free.

§2. Public surface

2.1 canonicalize(value: unknown): string

Walk value and return the byte-canonical JSON encoding (single-line, no whitespace, sorted object keys, deterministic).

2.2 byteLength(value: unknown): number

Return the UTF-8 byte length of canonicalize(value). Test/diagnostic helper backed by Buffer.byteLength(...).

2.3 No other exports

Internal helpers stay file-local.

§3. Invariants

I1. Determinism (load-bearing)

For all value for which canonicalize returns successfully:

  • canonicalize(value) === canonicalize(value) for repeated calls in the same process.
  • canonicalize(value) === canonicalize(structuralClone(value)) where structuralClone preserves all values but may reorder object keys at insertion time.
  • Two distinct processes (any platform, any Node version ≥ 20) produce byte-identical output for the same value.

I2. Sorted keys at every object level

Whenever the walker encodes an object, its keys are emitted in Array.prototype.sort() (default comparator) order. The default comparator on strings is the JavaScript abstract < operator, which compares 16-bit UTF-16 code units pairwise — equivalent to codepoint order for keys composed of BMP characters and well-defined for surrogate-pair keys.

Forbidden: Intl.Collator, String.prototype.localeCompare, any locale-aware comparator.

I3. No whitespace

Output contains only the canonical structural characters ({, }, [, ], ,, :, ", escape sequences, digits, sign, null, true, false) plus the literal characters of strings. No spaces, no newlines, no tabs between tokens.

A \n inside a string literal is encoded as the two-character escape sequence \n (i.e. backslash + n); the canonical output has no raw newline character.

I4. Integer literal form

bigint → decimal-string toString form, no n suffix. number (only integers per I9) → decimal-string toString form, no 1e3 notation. No leading zeros, no decimal point, no exponent. (BigInt’s toString() never emits leading zeros.)

I5. String escape canonical form

Each character c of an input string is encoded as follows:

Codepoint Output
0x22 (") \"
0x5C (\) \\
0x08 (BS) \b
0x0C (FF) \f
0x0A (LF) \n
0x0D (CR) \r
0x09 (HT) \t
0x00–0x1F (other) \u00XX (lowercase hex, 4 hex digits, zero-padded)
≥ 0x20 the literal character (UTF-16 code unit)

Note: lowercase hex is chosen to match JSON.stringify’s output and reduce diff churn against existing tooling. The / character is not escaped (the \/ form is legal JSON but optional; we don’t use it).

The string is wrapped in double quotes.

I6. Idempotence

Two successive canonicalizations on the same input produce byte-identical output:

canonicalize(canonicalize(x)) === '"' + escape(canonicalize(x)) + '"'

That is: feeding the JSON string back to canonicalize wraps it as a JSON string literal. (Because canonicalize sees string and emits a quoted form.)

The stronger AST-level property: if parse(dsl) returns RuleNode[], then

canonicalize(parse(dsl).ast) === canonicalize(parse(dsl).ast)

across separate calls and processes. (Trivially satisfied by I1, but the property test exercises 100 random ASTs.)

I7. Locale independence

canonicalize consults no locale state. Setting process.env.LANG or any Intl.* default at any point during the test run does not alter output.

I8. Error model

For inputs the encoder cannot represent, throw a CanonicalSerializationError (extends Error). Reasons:

  • undefined value (JSON has no undefined)
  • function value
  • non-integer number (NaN, Infinity, or fractional)
  • Symbol value
  • Non-plain object (Map, Set, Date, RegExp, Promise, class instance whose prototype isn’t Object.prototype or null)
  • Cycles (detected by a Set of object refs along the descent path; bound the recursion)

Errors mention the offending value’s runtime tag and a path string (e.g. at /guards/0/condition). Errors are deterministic given the input.

I9. Pure module

No I/O, no process.* reads, no env reads, no logging, no Date.*, no Math.*, no async. The module passes the determinism corpus self-scan in determinism.test.ts §Group 12.

I10. Stable byteLength

byteLength(value) === Buffer.byteLength(canonicalize(value), 'utf8'). This is a tautology by definition; documented so future refactors don’t decouple them.

§4. Behavioural test matrix (8 fixtures)

Mapped from §IMPLEMENTATION SPEC F1–F7 plus an error-model fixture.

ID Description Acceptance
F1 Object with two keys, inserted reverse-alpha Output keys sorted: {"a":2,"b":1}
F2 bigint 13n Output 13 (no n, no exponent)
F3 String 'hello\nworld' Output "hello\nworld" (escape sequence, single line)
F4 Nested array of objects with un-sorted keys Each object internally sorted
F5 Real DSL → parse via P1.2.2 → canonicalize twice → equal canonicalize(ast)===canonicalize(ast)
F6 Locale forced to Turkish Output unchanged from default-locale run
F7 Property: 100 random AST shapes; canonicalize twice idempotent All 100 pass
F8 Error model: undefined, Symbol, function, NaN, Map, Date all throw CanonicalSerializationError Each throws with informative message

Plus baseline coverage:

  • null, true, false, empty {}, empty [], deeply-nested mixed
  • Negative bigint
  • Strings containing every escape (backslash, quote, control char)
  • Object key sorting through a randomized insertion order
  • Cycle detection (object holding a ref to itself)

§5. Acceptance criteria (CI gate)

  • npm run build — TypeScript compiles canonical.ts cleanly under strict mode.
  • npm run lint — ESLint reports no errors in canonical.ts or its tests.
  • npm test — every fixture above passes; existing 1467-test baseline is preserved (no regression).
  • Coverage on canonical.ts ≥ 95% lines, 90% branches.
  • Determinism corpus self-scan (determinism.test.ts §Group 12) still green — canonical.ts introduces no forbidden tokens.

§6. Non-functional

  • The encoder must be O(n) in input size where n is the total node count (sort is the only super-linear step, and it’s O(k log k) per object level).
  • No global state. Re-entrant. Safe to call concurrently (Phase 1 is single-threaded but the contract permits future use).
  • No memoization, no caching across calls.

§7. Out of scope

  • Decoding JSON (out of scope; consumers use JSON.parse)
  • The hash function in P1.5.1 (canonical’s output is its INPUT)
  • A streaming variant (Phase 1 corpus easily fits memory)

§8. Open questions

None. The audit closed every spec question. Sign-off: ready to write packet.


Back to top

Colibri — documentation-first MCP runtime. Apache 2.0 + Commons Clause.

This site uses Just the Docs, a documentation theme for Jekyll.