Managing servers

Monitoring & alerts

What the background monitor checks, how alerts reach you in Telegram, and per-user preferences for mutes and quiet hours.

What runs every 5 minutes

The background monitor loops over every server in the database and runs four checks:

Gateway health — openclaw health --json. If it exits non-zero, fire a gateway_down alert.
Session tokens — parses openclaw sessions --all-agents --json and fires token_overflow if any session is above 80%.
Disk — df / --output=pcent. Fires disk_full (warning) if root is above 90%.
API provider errors — greps the gateway journalctl for recent "All models failed" messages and fires api_errors if any show up.

Consecutive failure threshold

If the monitor itself can't reach a server (SSH error, timeout, etc.), it waits for 3 consecutive failures before firing a monitor_error alert. This prevents flapping on a server that's restarting or on a jittery network.

Watchdog

A separate timer runs every minute and checks when the monitor last completed a full cycle. If it's been more than 15 minutes, it bypasses the normal alert pipeline and sends a direct Telegram message to all super-admins: "EXMER monitor is STALE — last completed N min ago". This catches hangs that the monitor can't notice itself.

Who gets alerts

For each fired alert, the recipient list is built from:

The server's owner
Admin members (not viewers) from server_access
All super-admins from ADMIN_USER_IDS

Each recipient then passes through their personal notification preferences.

Notification preferences

Every authenticated user can mute individual alert types and set a quiet-hours window. Defaults: everything on, no quiet hours. Managed at PUT /api/me/notification-prefs. Supports wraparound quiet hours (e.g. 22:00–07:00 for overnight).

Alert type	Fires when	Severity
`gateway_down`	openclaw health returns non-zero	critical
`connection_failed`	SSH connection to server fails	critical
`token_overflow`	Session tokens above 80%	warning
`disk_full`	Root disk above 90%	warning
`api_errors`	Provider errors in gateway logs	warning / critical
`monitor_error`	3 consecutive monitor failures	critical

Alert storage

Every fired alert is also stored in the alerts table with read/unread status. The Alerts page in the Mini App shows the full history.