Properties
category: spec tags: [monitoring, observability, plan] last_updated: 2026-03-18 confidence: high
Monitoring Dashboard Plan
Current State
Infrastructure already in place:
- healthcheck role: cron every minute, checks all 4 services (systemctl + HTTP), emails on failure/recovery via msmtp. Binary up/down only.
- diskmon role: cron checks
/and/srvat 80%/90% thresholds, emails alerts. - smoke-test.sh: post-deploy liveness + content checks for all 4 services + each wiki slug.
- No health endpoints in any app code (api_server.py, auth_server.py, mcp_entry.py have no
/healthor/statusroutes). - No metrics collection (no Prometheus, no structured logs beyond journald).
Recommendation: Option C — /app/admin/stats page in the management UI
Rationale: The management UI (port 8002, /app/*) already exists with auth, templates, and DB access. A stats page requires zero new infrastructure and is immediately accessible in-browser. Option A (enhanced health endpoint) gives machine-readable data but no dashboard. Option B (Prometheus) is new infra.
What to Build
One new route in app/api_server.py: GET /app/admin/stats
Access control: Require a valid platform JWT cookie AND is_admin flag on the user record (or, simpler, hardcode a set of admin DIDs from env var ROBOT_ADMIN_DIDS).
Page content (all point-in-time, no history):
| Section | Source |
|---|---|
| Service status | systemctl is-active for each of the 4 units |
| HTTP liveness | curl -s -o /dev/null -w "%{http_code}" http://localhost:{port}/ for each |
| Disk usage | df -h /srv |
| Wiki count | SELECT COUNT(*) FROM wikis |
| User count | SELECT COUNT(*) FROM users |
| Recent wikis | SELECT slug, created_at FROM wikis ORDER BY created_at DESC LIMIT 10 |
| Journal tail | journalctl -u robot-otterwiki -u robot-api -u robot-auth -u robot-mcp -n 50 --no-pager |
Implementation notes:
- Call
subprocess.runwithcapture_output=Truefor systemctl/df/journalctl — same pattern already used in_init_wiki_repo. - Use existing Jinja2 template system (Bootstrap already pulled in via otterwiki static assets).
- No JS required. Plain HTML table.
- Rate-limit the route (already have
flask_limiterwired up).
What NOT to Build
- No time-series data, no graphs, no retention.
- No
/healthJSON endpoint (the healthcheck cron already handles alerting; a JSON endpoint adds no dashboard value). - No Prometheus (new infra, not justified yet).
Ansible Changes
Add healthcheck and diskmon roles to deploy.yml (they are currently absent — only database and deploy roles run on deploy). This ensures the cron jobs are always present and current after a deploy.
# deploy.yml — add to roles list: - healthcheck - diskmon
The stats page itself requires no Ansible changes (it's app code deployed via the existing deploy role).
Files to Touch
app/api_server.py— add/app/admin/statsrouteapp/management/templates/admin_stats.html— new templateansible/deploy.yml— addhealthcheck+diskmonto rolestests/— one test for the route (auth required, returns 200 with admin creds)
Out of Scope
- Per-wiki metrics (page count, edit frequency) — future work
- Historical data / trending — future work
- Alerting changes — healthcheck cron already handles this
