Skip to main content

Sui Validator Alert Reference

When running a Sui Validator or full node, it is recommended to monitor the health of the nodes and set up alerts on problems. Besides alerting on crashes and other common issues, you might also want to configure alerting based on the following example rules.

The following sections cover the alert queries, but the details are meant to be customized to suit your infrastructure.

High-priority health alerts

These alerts should receive the most immediate attention from you or your team.

Crash loop

KeyValue
NameCrash loop
SummaryNode is crash looping
DurationRecommended to trigger after 15m
max without(version) (uptime) < 60 or absent(uptime)

Node is not staying up longer than 60s. Possible reasons:

  • Binary version too old.
  • Incorrect configuration.
  • Software bug.

Please notify Sui community on Discord if this cannot be resolved on your own.

Consensus proposals failure

KeyValue
NameConsensus proposals failure
SummaryConsensus block proposal rate is low
DurationRecommended to trigger after 1h
sum without(force) (rate(consensus_proposed_blocks[5m])) < 1.0

Validators with a slow consensus proposal rate can hurt network latency and throughput. It is usually due to network, disk or CPU performance issues.

Checkpoint execution rate is low

KeyValue
NameCheckpoint execution rate is low
SummaryValidator is not executing checkpoints quickly enough
DurationRecommended to trigger after 1h
rate(last_executed_checkpoint[5m]) < 1.0

Validators and full nodes with slow checkpoint execution will not have up-to-date information from the network. It is usually due to network, disk or CPU performance issues.

Safe mode during reconfiguration

KeyValue
NameSafe mode during reconfiguration
SummaryValidator or full node failed to advance the epoch and entered safe mode
DurationRecommended to trigger after 15m
is_safe_mode > 0.5

Usually this issue is outside the control of validator or full node operators. Please notify Sui community on Discord when this is observed.

Non-urgent and warning alerts

All alerts are important, but the following alerts and warnings can be addressed within the normal node maintenance workflow.

System invariant violations

KeyValue
NameSystem invariant violations
SummaryThe node reports an invariant violation
DurationRecommended to trigger after 15m
system_invariant_violations > 0

Usually this issue is outside the control of validator or full node operators. Please notify Sui community on Discord when this is observed.