GitHub

Monitoring

TigerBeetle supports emitting metrics via StatsD, and uses the DogStatsD format for tags.

This requires a StatsD compatible agent running locally. The Datadog Agent works out of the box with its default configuration, as does Telegraf’s statsd plugin, with datadog_extensions enabled.

You can enable emitting metrics by adding the following CLI flags to each replica, depending on your deployment method:

--experimental --statsd=127.0.0.1:8125

The --statsd argument must be specified as an IP:Port address (IPv4 or IPv6). DNS names are not currently supported.

All TigerBeetle metrics are namespaced under tb. and are tagged with cluster (the cluster ID specified at format time) and replica (the replica index). Specific metrics might have additional tags - you can see a full list of metrics and cardinality by running tigerbeetle inspect metrics.

Specific Metrics

Overall status

The replica_status metric corresponds to the overall status of the replica. If it’s anything other than 0, it should be alerted on as it indicates a non-normal status. The full values are:

Value Status Explanation
0 normal The replica is functioning normally.
1 view_change The replica is doing a view change.
2 recovering The replica is recovering. Usually, this will be present on startup before immediately transitioning to normal.
3 recovering_head The replica’s persistent state is corrupted, and it can’t participate in consensus. It will try and recover from the remainder of the cluster.

State sync status

The replica_sync_stage metric corresponds to the state sync stage. If this is anything other than 0, the replica is undergoing state sync and should be alerted on.

Operations timing

The replica_request timing metric can help inform how long requests are taking. This is tagged with the operation type (e.g., create_accounts) and is the closest measure of how long a request takes end to end, from the replica’s point of view.

It’s recommended to additionally add metrics around your TigerBeetle client code, to measure the full request latency, including things like network delay which aren’t captured here.

Cache monitoring and sizing

The grid_cache_hits and grid_cache_misses metrics can help inform if your grid cache (--cache-grid) is sized too small for your workload.

System Monitoring

In addition to TigerBeetle’s own metrics, it’s recommended to monitor and alert on a few additional system level metrics. These are:

  • Disk space used, on the path that has the TigerBeetle data file.
  • NTP clock sync status.
  • Memory utilization: once started, TigerBeetle will use a fixed amount of memory and not change. A change in memory utilization can indicate a problem with other processes on the server.
  • CPU utilization: TigerBeetle will use at most a single core at present. CPU utilization exceeding a single core can indicate a problem with other processes on the server.

While a specific alerting threshold is hard to define for the following, they are useful to monitor to help diagnose problems:

  • Network bandwidth utilization.
  • Disk bandwidth utilization.
Edit this page