ADR-003 — BFT Consensus: Build from scratch vs libp2p

Status: PROPOSED
Date: 2026-04-07
Domain: Legitimacy (θ Consensus)

Context

Colibri’s θ Consensus layer requires Byzantine Fault Tolerant (BFT) consensus: nodes must agree on which events are valid even when up to 1/3 of nodes are malicious or offline.

Reference algorithm is a PBFT-inspired implementation documented in docs/reference/extractions/:

  • ~450 lines of domain logic
  • Includes: quorum calculation, equivocation detection, view change protocol, Byzantine leader rotation, slashing conditions
  • No network transport layer — pure consensus state machine
  • Three quorum types: simple majority (>50%), super majority (>66%), unanimous

The decision is specifically about the Node.js implementation. The question is whether to build from scratch following the reference design, or integrate with libp2p’s consensus primitives.

Decision

TBD — requires PM decision.

This is the highest-risk architectural decision in Colibri’s implementation plan (flagged as such in MASTER-TASKS.md: “BFT consensus (phase 3) is most complex; recommend 2-week spike on gossip protocol alone”). The PM should commission a time-boxed spike before choosing.

Options

Option A: Build BFT from scratch

Implement BFT consensus in Node.js following the reference algorithm documentation, building:

  • Core BFT state machine (~450 lines equivalent)
  • Quorum calculation and vote collection
  • Equivocation detection and slashing
  • View change protocol for Byzantine leader rotation
  • Peer discovery and connection management
  • IHAVE/IWANT gossip protocol

Pros:

  • Full control over every part of the consensus logic
  • The reference design is already aligned with Colibri’s specific constraints (subjective finality, experience-token slashing, fork integration)
  • No framework dependency; easier to audit
  • Consensus logic is already designed to integrate with governance (voting mode selection) and fork (fork trigger on quorum failure)

Cons:

  • High implementation risk — BFT protocols are notoriously easy to get subtly wrong
  • Network layer (gossip, peer discovery) must also be built
  • Testing distributed consensus correctly requires multi-node integration tests
  • Estimated 5-6 weeks (per implementation phase estimates) with high variance

Target files (Phase 3 implementation tasks):

  • src/consensus/voting.js — core BFT voting
  • src/consensus/equivocation.js — double-vote detection
  • src/consensus/view-change.js — PBFT view change
  • src/consensus/finality.js — 5-level finality tracking
  • src/gossip/ — IHAVE/IWANT gossip protocol

Option B: Build on libp2p

Use @libp2p/libp2p as the networking layer and a libp2p-compatible consensus module (e.g., @chainsafe/libp2p-gossipsub for gossip, custom BFT layer on top).

Pros:

  • Battle-tested P2P networking: peer discovery, NAT traversal, multiplexed streams
  • GossipSub (libp2p’s pubsub) is production-grade and replaces the custom IHAVE/IWANT implementation
  • Reduces implementation scope by ~40% (network layer is provided)
  • Strong TypeScript types and active maintenance

Cons:

  • libp2p is a complex framework with its own abstractions (PeerId, Multiaddr, Dialer)
  • Adding libp2p adds ~15-20 transitive npm dependencies
  • libp2p’s consensus is Ethereum-oriented (CL clients); adapting it to Colibri’s subjective-finality model requires significant customization
  • The gossip semantics differ: libp2p GossipSub uses message IDs and mesh topology, not the simple IHAVE/IWANT of gossip.py
  • Integration with Colibri’s fork-scoped event logs (ι State Fork) is non-trivial

Option C: Two-phase approach

Phase 3a (spike, 2 weeks): build a minimal BFT state machine (src/consensus/voting.js) without network transport. Use direct function calls between nodes in integration tests. Validate that the consensus logic is correct.

Phase 3b (decision point): after the spike, choose between Option A full port (network layer included) or Option B libp2p for transport only, keeping the custom BFT logic from Phase 3a.

This is the recommended approach per MASTER-TASKS.md.

Consequences

If Option A (full scratch port):

  • 5-6 weeks total; highest variance
  • Complete control; fully auditable
  • Risk: subtle BFT bugs discovered late

If Option B (libp2p):

  • 3-4 weeks for network layer; 2-3 weeks for BFT layer on top; ~5-7 weeks total
  • Higher dependency footprint
  • Risk: libp2p abstraction friction with Colibri’s subjective-finality model

If Option C (two-phase):

  • 2-week spike produces validated BFT logic
  • Network transport decision deferred until consensus logic is confirmed correct
  • Allows an informed Option A vs B decision with real code evidence

Alternatives Considered

  • Tendermint / CometBFT: too heavy (Go binary); not compatible with Node.js MCP server process model
  • Hyperledger Fabric ordering service: enterprise-oriented, overly complex for Colibri’s scale
  • Raft consensus (not BFT): Raft tolerates crash failures, not Byzantine failures; does not satisfy AX-03 (no absolute authority) because a Raft leader can act arbitrarily

References

  • Reference algorithm in docs/reference/extractions/theta-consensus-extraction.md (~450 lines)
  • θ — Consensus concept — reader-friendly introduction
  • S06 — Consensus — full BFT specification
  • ADR-002 — VRF library (used for leader election inside BFT)
  • Phase 3 implementation tasks in task breakdown

Back to top

Colibri — documentation-first MCP runtime. Apache 2.0 + Commons Clause.

This site uses Just the Docs, a documentation theme for Jekyll.