VSR
Documentation for (roughly) code in the src/vsr
directory.
Glossary
Consensus:
- checkpoint: Ensure that all updates from the past wrap of the WAL are durable in the grid, then advance the replica's recovery point by updating the superblock. After a checkpoint, the checkpointed WAL entries are safe to be overwritten by the next wrap. (Sidenote: in consensus literature this is sometimes called snapshotting. But we use that term to mean something else.)
- header: Identifier for many kinds of messages, including each entry in the VSR log. Passed around instead of the entry when the full entry is not needed (such as view change).
- journal: The in-memory data structure that manages the WAL.
- nack: Short for negative acknowledgement. Used to determine (during a view change) which entries can be truncated from the log. See Protocol Aware Recovery.
- op: Short for op-number. An op is assigned to each request that is submitted by the user before being stored in the log. An op is a monotonically increasing integer identifying each message to be handled by consensus. When messages with the same op in different views conflict, view change picks one version to commit. Each user batch (which may contain many batch entries) corresponds to one op. Each op is identified (once inside the VSR log) by a header.
- superblock: All local state for the replica that cannot be replicated remotely. Loss is protected against by storing
config.superblock_copies
copies of the superblock. - view: A replica is primary for one view. Views are monotonically increasing integers that are incremented each time a new primary is selected.
Storage:
- zone: The TigerBeetle data file is made up of zones. The superblock is one zone.
- grid: The zone on disk where LSM trees and metadata for them reside.
- WAL: Write-ahead log. It is implemented as two on-disk ring buffers. Entries are only overwritten after they have been checkpointed.
- state sync: The process of syncing checkpointed data (LSM root information, the grid, and the superblock freeset). When a replica lags behind the cluster far enough that their WALs no longer intersect, the lagging replica must state sync to catch up.
Protocols
Commands
vsr.Header.Command | Source | Target | Protocols |
---|---|---|---|
ping | replica | replica | Ping (Replica-Replica) |
pong | replica | replica | Ping (Replica-Replica) |
ping_client | client | replica | Ping (Replica-Client) |
pong_client | replica | client | Ping (Replica-Client) |
request | client | primary | Normal |
prepare | replica | backup | Normal, Repair WAL |
prepare_ok | replica | primary | Normal, Repair WAL |
reply | primary | client | Normal, Repair Client Replies, Sync Client Replies |
commit | primary | backup | Normal |
start_view_change | replica | all replicas | Start-View-Change |
do_view_change | replica | all replicas | View-Change |
start_view | primary | backup | Request/Start View, State Sync |
request_start_view | backup | primary | Request/Start View |
request_headers | replica | replica | Repair Journal |
request_prepare | replica | replica | Repair WAL |
request_reply | replica | replica | Repair Client Replies, Sync Client Replies |
headers | replica | replica | Repair Journal |
eviction | primary | client | Client |
request_blocks | replica | replica | Sync Forest, Repair Grid |
block | replica | replica | Sync Forest, Repair Grid |
Recovery
Unlike VRR, TigerBeetle does not implement Recovery Protocol (see §4.3). Instead, replicas persist their VSR state to the superblock. This ensures that a recovering replica never backtracks to an older view (from the point of view of the cluster).
Protocol: Ping (Replica-Replica)
Replicas send command=ping
/command=pong
messages to one another to synchronize clocks.
Protocol: Ping (Replica-Client)
Clients send command=ping_client
(and receive command=pong_client
) messages to (from) replicas to learn the cluster's current view.
Protocol: Normal
Normal protocol prepares and commits requests (from clients) and sends replies (to clients).
- The client sends a
command=request
message to the primary. (If the client's view is outdated, the receiver will forward the message on to the actual primary). - The primary converts the
command=request
to acommand=prepare
(assigning it anop
andtimestamp
). - Each replica (in a chain beginning with the primary) performs the following steps concurrently:
- Write the prepare to the WAL.
- Forward the prepare to the next replica in the chain.
- Each replica sends a
command=prepare_ok
message to the primary once it has written the prepare to the WAL. - When a primary collects a replication quorum of
prepare_ok
s and it has committed all preceding prepares, it commits the prepare. - The primary replies to the client.
- The backups are informed that the prepare was committed by either:
- a subsequent prepare, or
- a periodic
command=commit
heartbeat message.
See also:
- VRR §4.1
Protocol: Start-View-Change
Start-View-Change (SVC) protocol initiates view-changes with minimal disruption.
Unlike the Start-View-Change described in VRR §4.2, this protocol runs in both status=normal
and status=view_change
(not just status=view_change
).
- Depending on the replica's status:
status=normal
& primary: When the replica has not recently received aprepare_ok
(and it has a prepare in flight), pause broadcastingcommand=commit
.status=normal
& backup: When the replica has not recently received acommand=commit
, broadcastcommand=start_view_change
to all replicas (including self).status=view_change
: If the replica has not completed a view-change recently, send acommand=start_view_change
to all replicas (including self).
- (Periodically retry sending the SVC).
- If the backup receives a
command=commit
or changes views (respectively), stop thecommand=start_view_change
retries. - If the replica collects a view-change quorum of SVC messages, transition to
status=view_change
for the next view. (That is, increment the replica's view and start sending a DVC).
This protocol approach enables liveness under asymmetric network partitions. For example, a replica which can send to the cluster but not receive may send SVCs, but if the remainder of the cluster is healthy, they will never achieve a quorum, so the view is stable. When the partition heals, the formerly-isolated replica may rejoin the original view (if it was isolated in status=normal
) or a new view (if it was isolated in status=view_change
).
See also:
- Raft does not Guarantee Liveness in the face of Network Faults ("PreVote and CheckQuorum")
- "Consensus: Bridging Theory and Practice" §6.2 "Leaders" describes periodically committing a heartbeat to detect stale leaders.
Protocol: View-Change
A replica sends command=do_view_change
to all replicas, with the view
it is attempting to start.
- The primary of the
view
collects a view-change quorum of DVCs. - The backup of the
view
uses todo_view_change
to update its currentview
(transitioning tostatus=view_change
).
DVCs include headers from prepares which are:
- present: A valid header, corresponding to a valid prepare in the replica's WAL.
- missing: A valid header, corresponding to a prepare that the replica has not prepared/acked.
- corrupt: A valid header, corresponding to a corrupt prepare in the replica's WAL.
- blank: A placeholder (fake) header, corresponding to a header that the replica has never seen.
- fault: A placeholder (fake) header, corresponding to a header that the replica may have prepared/acked.
If the new primary collects a nack quorum of blank headers for a particular possibly-uncommitted op, it truncates the log.
These cases are farther distinguished during WAL repair.
When the primary collects its DVC quorum:
- If any DVC in the quorum is ahead of the primary by more than one checkpoint, the new primary "forfeits" (that is, it immediately triggers another view change).
- If any DVC in the quorum is ahead of the primary by more than one checkpoint, and any messages in the next checkpoint are possibly committed, the new primary forfeits.
- The primary installs the headers to its suffix.
- Then the primary repairs its headers. (Protocol: Repair Journal).
- Then the primary repairs its prepares. (Protocol: Repair WAL) (and potentially truncates uncommitted ops).
- Then primary commits all prepares which are not known to be uncommitted.
- Then the primary transitions to
status=normal
and broadcasts acommand=start_view
.
Protocol: Request/Start View
request_start_view
A backup sends a command=request_start_view
to the primary of a view when any of the following occur:
- the backup learns about a newer view via a
command=commit
message, or - the backup learns about a newer view via a
command=prepare
message, or - the backup discovers
commit_max
exceedsmin(op_head, op_checkpoint_next_trigger)
(during repair), - the backup can't make progress committing and needs to state sync, or
- a replica recovers to
status=recovering_head
start_view
When a status=normal
primary receives command=request_start_view
, it replies with a command=start_view
.
command=start_view
includes:
- The view's current suffix — the headers of the latest messages in the view.
- The current checkpoint (see State Sync).
Together, the checkpoint and the view headers fully specify the logical and physical state of the view.
Upon receiving a start_view
for the new view, the backup installs the checkpoint if needed, installs the suffix, transitions to status=normal
, and begins repair.
A start_view
contains the following headers (which may overlap):
- The suffix:
pipeline_prepare_queue_max
headers from the head op down. - The "hooks": the header of any previous checkpoint triggers within our repairable range. This helps a lagging replica catch up. (There are at most 2).
Protocol: Repair Journal
request_headers
and headers
repair gaps or breaks in a replica's journal headers.
Repaired headers are a prerequisite for repairing prepares.
Because the headers are repaired backwards (from the head) by hash-chaining, it is safe for both backups and transitioning primaries.
Gaps/breaks in a replica's journal headers may occur:
- On a backup, receiving nonconsecutive ops, leaving a gap in its headers.
- On a backup, which has not finished repair.
- On a new primary during a view-change, which has not finished repair.
Protocol: Repair WAL
The replica's journal tracks which prepares the WAL requires — i.e. headers for which either:
- no prepare was ever received, or
- the prepare was received and written, but was since discovered to be corrupt
During repair, missing/damaged prepares are requested & repaired chronologically, which:
- improves the chances that older entries will be available, i.e. not yet overwritten
- enables better pipelining of repair and commit.
In response to a request_prepare
:
- Reply the
command=prepare
with the requested prepare, if available and valid. - Otherwise do not reply. (e.g. the corresponding slot in the WAL is corrupt)
Per PAR's CTRL Protocol, we do not nack corrupt entries, since they might be the prepare being requested.
See also State Sync protocol — the extent of WAL that the replica can/should repair depends on the checkpoint.
Protocol: Repair Client Replies
The replica stores the latest reply to each active client.
During repair, corrupt client replies are requested & repaired.
In response to a request_reply
:
- Respond with the
command=reply
(the requested reply), if available and valid. - Otherwise do not reply.
Protocol: Client
- Client sends
command=request operation=register
to registers with the cluster by starting a new request-reply hashchain. (See also: Protocol: Normal). - Client receives
command=reply operation=register
from the cluster. (If the cluster is at the maximum number of clients, it evicts the oldest). - Repeat:
- Send
command=request
to cluster. - If the client has been evicted, receive
command=eviction
from the cluster. (The client must re-register before sending more requests.) - If the client has not been evicted, receive
command=reply
from cluster.
- Send
See also:
Protocol: Repair Grid
Grid repair is triggered when a replica discovers a corrupt (or missing) grid block.
- The repairing replica sends a
command=request_blocks
to any other replica. The message body contains a list of blockaddress
/checksum
s. - Upon receiving a
command=request_blocks
, a replica reads its own grid to check for the requested blocks. For each matching block found, reply with thecommand=block
message (the block itself). - Upon receiving a
command=block
, a replica writes the block to its grid, and resolves the reads that were blocked on it.
Note that both sides of grid repair can run while the grid is being opened during replica startup. That is, a replica can help other replicas repair and repair itself simultaneously.
TODO Describe state sync fallback.
Protocol: Sync Client Replies
Sync missed client replies using Protocol: Repair Grid.
See State Sync for details.
Protocol: Sync Forest
Sync missed LSM manifest and table blocks using Protocol: Repair Grid.
See State Sync for details.
Protocol: Reconfiguration
TODO (Unimplemented)
Quorums
- The replication quorum is the minimum number of replicas required to complete a commit.
- The view-change quorum is the minimum number of replicas required to complete a view-change.
- The nack quorum is the minimum number of unique nacks required to truncate an uncommitted op.
With the default configuration:
Replica Count | 1 | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|---|
Replication Quorum | 1 | 2 | 2 | 2 | 3 | 3 |
View-Change Quorum | 1 | 2 | 2 | 3 | 3 | 4 |
Nack Quorum | 1 | 1 | 2 | 3 | 3 | 4 |
See also:
constants.quorum_replication_max
for configuration.- Flexible Paxos