Skip to main content

State Sync

State sync synchronizes the state of a lagging replica with the healthy cluster.

State sync is used when when a lagging replica's log no longer intersects with the cluster's current log — WAL repair cannot catch the replica up.

(VRR refers to state sync as "state transfer", but we already have transfers elsewhere.)

In the context of state sync, "state" refers to:

  1. the superblock vsr_state.checkpoint
  2. the grid (manifest, free set, and client sessions blocks)
  3. the grid (LSM table data; acquired blocks only)
  4. client replies

State sync consists of four protocols:

The target of superblock-sync is the latest checkpoint of the healthy cluster. When we catch up to the latest checkpoint (or very close to it), then we can transition back to a healthy state.

Glossary

Replica roles:

  • syncing replica: A replica performing superblock-sync. (Any step within 1-10 of the sync algorithm)
  • healthy replica: A replica not performing superblock-sync — part of the active cluster.
  • divergent replica: A replica with a checkpoint that is (and can never be) canonical.

Checkpoints:

Algorithm

  1. Sync is needed.
  2. Trigger sync.
  3. Wait for non-grid commit operation to finish.
  4. Wait for grid IO to finish. (See Grid.cancel().)
  5. Wait for a usable sync target to arrive. (Usually we already have one.)
  6. Begin sync-superblock protocol.
  7. Request superblock checkpoint state.
  8. Update the superblock headers with:
    • Bump vsr_state.checkpoint.commit_min/vsr_state.checkpoint.commit_min_checksum to the sync target op/op-checksum.
    • Bump vsr_state.checkpoint.parent_checkpoint_id to the checkpoint id that is previous to our sync target (i.e. it isn't our previous checkpoint).
    • Bump replica.commit_min. (If replica.commit_min exceeds replica.op, transition to status=recovering_head).
    • Set vsr_state.sync_op_min to the minimum op which has not been repaired.
    • Set vsr_state.sync_op_max to the maximum op which has not been repaired.
  9. Sync-superblock protocol is done.
  10. Repair replies, free set, client sessions, and manifest blocks, and table blocks that were created within the sync_op_{min,max} range.
  11. Update the superblock with:
    • Set vsr_state.sync_op_min = 0
    • Set vsr_state.sync_op_max = 0

If a newer sync target is discovered during steps 5-6 or 9, go to step 4.

If the replica starts up with vsr_state.sync_op_max ≠ 0, go to step 9.

0: Scenarios

Scenarios requiring state sync:

  1. A replica was down/partitioned/slow for a while and the rest of the cluster moved on. The lagging replica is too far behind to catch up via WAL repair.
  2. A replica was just formatted and is being added to the cluster (i.e. via reconfiguration). The new replica is too far behind to catch up via WAL repair.

Causes of number 3:

  • A storage determinism bug.
  • An upgraded replica (e.g. a canary) running a different version of the code from the remainder of the cluster, which unexpectedly changes its history. (The change either has a bug or should have been gated behind a feature flag.)

1: Triggers

State sync is initially triggered by any of the following:

  • The replica receives a SV which indicates that it has lagged so far behind the cluster that its log cannot possibly intersect.
  • repair_sync_timeout fires, and:
    • a WAL or grid repair is in progress and,
    • the replica's checkpoint is lagging behind the cluster's (far enough that the repair may never complete).

6: Request Superblock Checkpoint State

The syncing replica sends command=request_sync_checkpoint messages (with the sync target identifier attached to each) until it receives a command=sync_checkpoint with a matching checkpoint identifier.

Concepts

Syncing Replica

Syncing replicas may:

Syncing replicas must not:

  • ack
  • commit prepares
  • be a primary

Syncing Replicas write prepares to their WAL.

When the replica completes superblock-sync, an up-to-date WAL and journal allow it to quickly catch up (i.e. commit) to the current cluster state.

Syncing Replicas don't ack prepares.

If syncing replicas did ack prepares:

Consider a cluster of 3 replicas:

  • the primary,
  • a normal backup, and
  • a syncing backup.
  1. Primary prepares many ops...
  2. Syncing backup prepares and acknowledges all of those messages.
  3. Normal backup is partitioned — its not seeing any of these prepares.
  4. Primary is receiving prepare_oks from the syncing backup, so it is committing.
  5. Primary eventually checkpoints.
  6. (This cycle repeats — primary keeps preparing/committing, syncing backup keeps preparing, and normal backup is still partitioned.)

But now primary is so far ahead that the normal backup needs to sync! Having 2/3 replicas syncing means that a single grid-block corruption on the primary could make the cluster permanently unavailable.

Checkpoint Identifier

A checkpoint id is a hash of the superblock CheckpointState.

A checkpoint identifier is attached to the following message types:

  • command=commit: Current checkpoint identifier of sender.
  • command=ping: Current checkpoint identifier of sender.
  • command=prepare: The attached checkpoint id is the checkpoint id during which the corresponding prepare was originally prepared.
  • command=prepare_ok: The attached checkpoint id is the checkpoint id during which the corresponding prepare was originally prepared.
  • command=request_sync_checkpoint: Requested checkpoint identifier.
  • command=sync_checkpoint: Current checkpoint identifier of sender.

Sync Target

A sync target is the checkpoint identifier of the checkpoint that the superblock-sync is syncing towards.

Not all checkpoint identifiers are valid sync targets.

Every sync target must:

  • have an op greater than the syncing replica's current checkpoint op.
  • either:
    • be committed atop – i.e. the syncing replica can sync to healthy_replica.checkpoint_op when trigger_for_checkpoint(healthy_replica.checkpoint_op) < healthy_replica.commit_min – ensuring that the checkpoint has been reached by a quorum of replicas, or
    • be more than 1 checkpoint ahead of our current checkpoint.

Storage Determinism

When everything works, storage is deterministic. If non-determinism is detected (via checkpoint id mismatches) the replica which detects the mismatch will panic. This scenario should prompt operator investigation and manual intervention.