ADR-003 — BFT Consensus: Build from scratch vs libp2p

Status: PROPOSED
Date: 2026-04-07
Domain: Legitimacy (θ Consensus)

Context

Colibri’s θ Consensus layer requires Byzantine Fault Tolerant (BFT) consensus: nodes must agree on which events are valid even when up to 1/3 of nodes are malicious or offline.

Reference algorithm is a PBFT-inspired implementation documented in docs/reference/extractions/:

~450 lines of domain logic
Includes: quorum calculation, equivocation detection, view change protocol, Byzantine leader rotation, slashing conditions
No network transport layer — pure consensus state machine
Three quorum types: simple majority (>50%), super majority (>66%), unanimous

The decision is specifically about the Node.js implementation. The question is whether to build from scratch following the reference design, or integrate with libp2p’s consensus primitives.

Decision

TBD — requires PM decision.

This is the highest-risk architectural decision in Colibri’s implementation plan (flagged as such in MASTER-TASKS.md: “BFT consensus (phase 3) is most complex; recommend 2-week spike on gossip protocol alone”). The PM should commission a time-boxed spike before choosing.

Options

Option A: Build BFT from scratch

Implement BFT consensus in Node.js following the reference algorithm documentation, building:

Core BFT state machine (~450 lines equivalent)
Quorum calculation and vote collection
Equivocation detection and slashing
View change protocol for Byzantine leader rotation
Peer discovery and connection management
IHAVE/IWANT gossip protocol

Pros:

Full control over every part of the consensus logic
The reference design is already aligned with Colibri’s specific constraints (subjective finality, experience-token slashing, fork integration)
No framework dependency; easier to audit
Consensus logic is already designed to integrate with governance (voting mode selection) and fork (fork trigger on quorum failure)

Cons:

High implementation risk — BFT protocols are notoriously easy to get subtly wrong
Network layer (gossip, peer discovery) must also be built
Testing distributed consensus correctly requires multi-node integration tests
Estimated 5-6 weeks (per implementation phase estimates) with high variance

Target files (Phase 3 implementation tasks):

src/consensus/voting.js — core BFT voting
src/consensus/equivocation.js — double-vote detection
src/consensus/view-change.js — PBFT view change
src/consensus/finality.js — 5-level finality tracking
src/gossip/ — IHAVE/IWANT gossip protocol

Option B: Build on libp2p

Use @libp2p/libp2p as the networking layer and a libp2p-compatible consensus module (e.g., @chainsafe/libp2p-gossipsub for gossip, custom BFT layer on top).

Pros:

Battle-tested P2P networking: peer discovery, NAT traversal, multiplexed streams
GossipSub (libp2p’s pubsub) is production-grade and replaces the custom IHAVE/IWANT implementation
Reduces implementation scope by ~40% (network layer is provided)
Strong TypeScript types and active maintenance

Cons:

libp2p is a complex framework with its own abstractions (PeerId, Multiaddr, Dialer)
Adding libp2p adds ~15-20 transitive npm dependencies
libp2p’s consensus is Ethereum-oriented (CL clients); adapting it to Colibri’s subjective-finality model requires significant customization
The gossip semantics differ: libp2p GossipSub uses message IDs and mesh topology, not the simple IHAVE/IWANT of gossip.py
Integration with Colibri’s fork-scoped event logs (ι State Fork) is non-trivial

Option C: Two-phase approach

Phase 3a (spike, 2 weeks): build a minimal BFT state machine (src/consensus/voting.js) without network transport. Use direct function calls between nodes in integration tests. Validate that the consensus logic is correct.

Phase 3b (decision point): after the spike, choose between Option A full port (network layer included) or Option B libp2p for transport only, keeping the custom BFT logic from Phase 3a.

This is the recommended approach per MASTER-TASKS.md.

Consequences

If Option A (full scratch port):

5-6 weeks total; highest variance
Complete control; fully auditable
Risk: subtle BFT bugs discovered late

If Option B (libp2p):

3-4 weeks for network layer; 2-3 weeks for BFT layer on top; ~5-7 weeks total
Higher dependency footprint
Risk: libp2p abstraction friction with Colibri’s subjective-finality model

If Option C (two-phase):

2-week spike produces validated BFT logic
Network transport decision deferred until consensus logic is confirmed correct
Allows an informed Option A vs B decision with real code evidence

Alternatives Considered

Tendermint / CometBFT: too heavy (Go binary); not compatible with Node.js MCP server process model
Hyperledger Fabric ordering service: enterprise-oriented, overly complex for Colibri’s scale
Raft consensus (not BFT): Raft tolerates crash failures, not Byzantine failures; does not satisfy AX-03 (no absolute authority) because a Raft leader can act arbitrarily

References

Reference algorithm in docs/reference/extractions/theta-consensus-extraction.md (~450 lines)
θ — Consensus concept — reader-friendly introduction
S06 — Consensus — full BFT specification
ADR-002 — VRF library (used for leader election inside BFT)
Phase 3 implementation tasks in task breakdown