ADR-003 — BFT Consensus: Build from scratch vs libp2p
Status: PROPOSED
Date: 2026-04-07
Domain: Legitimacy (θ Consensus)
Context
Colibri’s θ Consensus layer requires Byzantine Fault Tolerant (BFT) consensus: nodes must agree on which events are valid even when up to 1/3 of nodes are malicious or offline.
Reference algorithm is a PBFT-inspired implementation documented in docs/reference/extractions/:
- ~450 lines of domain logic
- Includes: quorum calculation, equivocation detection, view change protocol, Byzantine leader rotation, slashing conditions
- No network transport layer — pure consensus state machine
- Three quorum types: simple majority (>50%), super majority (>66%), unanimous
The decision is specifically about the Node.js implementation. The question is whether to build from scratch following the reference design, or integrate with libp2p’s consensus primitives.
Decision
TBD — requires PM decision.
This is the highest-risk architectural decision in Colibri’s implementation plan (flagged as such in MASTER-TASKS.md: “BFT consensus (phase 3) is most complex; recommend 2-week spike on gossip protocol alone”). The PM should commission a time-boxed spike before choosing.
Options
Option A: Build BFT from scratch
Implement BFT consensus in Node.js following the reference algorithm documentation, building:
- Core BFT state machine (~450 lines equivalent)
- Quorum calculation and vote collection
- Equivocation detection and slashing
- View change protocol for Byzantine leader rotation
- Peer discovery and connection management
- IHAVE/IWANT gossip protocol
Pros:
- Full control over every part of the consensus logic
- The reference design is already aligned with Colibri’s specific constraints (subjective finality, experience-token slashing, fork integration)
- No framework dependency; easier to audit
- Consensus logic is already designed to integrate with governance (voting mode selection) and fork (fork trigger on quorum failure)
Cons:
- High implementation risk — BFT protocols are notoriously easy to get subtly wrong
- Network layer (gossip, peer discovery) must also be built
- Testing distributed consensus correctly requires multi-node integration tests
- Estimated 5-6 weeks (per implementation phase estimates) with high variance
Target files (Phase 3 implementation tasks):
src/consensus/voting.js— core BFT votingsrc/consensus/equivocation.js— double-vote detectionsrc/consensus/view-change.js— PBFT view changesrc/consensus/finality.js— 5-level finality trackingsrc/gossip/— IHAVE/IWANT gossip protocol
Option B: Build on libp2p
Use @libp2p/libp2p as the networking layer and a libp2p-compatible consensus module (e.g., @chainsafe/libp2p-gossipsub for gossip, custom BFT layer on top).
Pros:
- Battle-tested P2P networking: peer discovery, NAT traversal, multiplexed streams
- GossipSub (libp2p’s pubsub) is production-grade and replaces the custom IHAVE/IWANT implementation
- Reduces implementation scope by ~40% (network layer is provided)
- Strong TypeScript types and active maintenance
Cons:
- libp2p is a complex framework with its own abstractions (PeerId, Multiaddr, Dialer)
- Adding libp2p adds ~15-20 transitive npm dependencies
- libp2p’s consensus is Ethereum-oriented (CL clients); adapting it to Colibri’s subjective-finality model requires significant customization
- The gossip semantics differ: libp2p GossipSub uses message IDs and mesh topology, not the simple IHAVE/IWANT of
gossip.py - Integration with Colibri’s fork-scoped event logs (ι State Fork) is non-trivial
Option C: Two-phase approach
Phase 3a (spike, 2 weeks): build a minimal BFT state machine (src/consensus/voting.js) without network transport. Use direct function calls between nodes in integration tests. Validate that the consensus logic is correct.
Phase 3b (decision point): after the spike, choose between Option A full port (network layer included) or Option B libp2p for transport only, keeping the custom BFT logic from Phase 3a.
This is the recommended approach per MASTER-TASKS.md.
Consequences
If Option A (full scratch port):
- 5-6 weeks total; highest variance
- Complete control; fully auditable
- Risk: subtle BFT bugs discovered late
If Option B (libp2p):
- 3-4 weeks for network layer; 2-3 weeks for BFT layer on top; ~5-7 weeks total
- Higher dependency footprint
- Risk: libp2p abstraction friction with Colibri’s subjective-finality model
If Option C (two-phase):
- 2-week spike produces validated BFT logic
- Network transport decision deferred until consensus logic is confirmed correct
- Allows an informed Option A vs B decision with real code evidence
Alternatives Considered
- Tendermint / CometBFT: too heavy (Go binary); not compatible with Node.js MCP server process model
- Hyperledger Fabric ordering service: enterprise-oriented, overly complex for Colibri’s scale
- Raft consensus (not BFT): Raft tolerates crash failures, not Byzantine failures; does not satisfy AX-03 (no absolute authority) because a Raft leader can act arbitrarily
References
- Reference algorithm in docs/reference/extractions/theta-consensus-extraction.md (~450 lines)
- θ — Consensus concept — reader-friendly introduction
- S06 — Consensus — full BFT specification
- ADR-002 — VRF library (used for leader election inside BFT)
- Phase 3 implementation tasks in task breakdown