Monitoring & alerts
What the background monitor checks, how alerts reach you in Telegram, and per-user preferences for mutes and quiet hours.
What runs every 5 minutes
The background monitor loops over every server in the database and runs four checks:
- Gateway health —
openclaw health --json. If it exits non-zero, fire agateway_downalert. - Session tokens — parses
openclaw sessions --all-agents --jsonand firestoken_overflowif any session is above 80%. - Disk —
df / --output=pcent. Firesdisk_full(warning) if root is above 90%. - API provider errors — greps the gateway journalctl for recent "All models failed" messages and fires
api_errorsif any show up.
Consecutive failure threshold
If the monitor itself can't reach a server (SSH error, timeout, etc.), it waits for 3 consecutive failures before firing a monitor_error alert. This prevents flapping on a server that's restarting or on a jittery network.
Watchdog
A separate timer runs every minute and checks when the monitor last completed a full cycle. If it's been more than 15 minutes, it bypasses the normal alert pipeline and sends a direct Telegram message to all super-admins: "EXMER monitor is STALE — last completed N min ago". This catches hangs that the monitor can't notice itself.
Who gets alerts
For each fired alert, the recipient list is built from:
- The server's owner
- Admin members (not viewers) from
server_access - All super-admins from
ADMIN_USER_IDS
Each recipient then passes through their personal notification preferences.
Notification preferences
Every authenticated user can mute individual alert types and set a quiet-hours window. Defaults: everything on, no quiet hours. Managed at PUT /api/me/notification-prefs. Supports wraparound quiet hours (e.g. 22:00–07:00 for overnight).
| Alert type | Fires when | Severity |
|---|---|---|
gateway_down | openclaw health returns non-zero | critical |
connection_failed | SSH connection to server fails | critical |
token_overflow | Session tokens above 80% | warning |
disk_full | Root disk above 90% | warning |
api_errors | Provider errors in gateway logs | warning / critical |
monitor_error | 3 consecutive monitor failures | critical |
Alert storage
Every fired alert is also stored in the alerts table with read/unread status. The Alerts page in the Mini App shows the full history.