Properties

category: spec
tags: [monitoring, observability, plan]
last_updated: 2026-03-18
confidence: high

Monitoring Dashboard Plan

Current State

Infrastructure already in place:

healthcheck role: cron every minute, checks all 4 services (systemctl + HTTP), emails on failure/recovery via msmtp. Binary up/down only.
diskmon role: cron checks / and /srv at 80%/90% thresholds, emails alerts.
smoke-test.sh: post-deploy liveness + content checks for all 4 services + each wiki slug.
No health endpoints in any app code (api_server.py, auth_server.py, mcp_entry.py have no /health or /status routes).
No metrics collection (no Prometheus, no structured logs beyond journald).

Recommendation: Option C — `/app/admin/stats` page in the management UI

Rationale: The management UI (port 8002, /app/*) already exists with auth, templates, and DB access. A stats page requires zero new infrastructure and is immediately accessible in-browser. Option A (enhanced health endpoint) gives machine-readable data but no dashboard. Option B (Prometheus) is new infra.

What to Build

One new route in app/api_server.py: GET /app/admin/stats

Access control: Require a valid platform JWT cookie AND is_admin flag on the user record (or, simpler, hardcode a set of admin DIDs from env var ROBOT_ADMIN_DIDS).

Page content (all point-in-time, no history):

Section	Source
Service status	`systemctl is-active` for each of the 4 units
HTTP liveness	`curl -s -o /dev/null -w "%{http_code}" http://localhost:{port}/` for each
Disk usage	`df -h /srv`
Wiki count	`SELECT COUNT(*) FROM wikis`
User count	`SELECT COUNT(*) FROM users`
Recent wikis	`SELECT slug, created_at FROM wikis ORDER BY created_at DESC LIMIT 10`
Journal tail	`journalctl -u robot-otterwiki -u robot-api -u robot-auth -u robot-mcp -n 50 --no-pager`

Implementation notes:

Call subprocess.run with capture_output=True for systemctl/df/journalctl — same pattern already used in _init_wiki_repo.
Use existing Jinja2 template system (Bootstrap already pulled in via otterwiki static assets).
No JS required. Plain HTML table.
Rate-limit the route (already have flask_limiter wired up).

What NOT to Build

No time-series data, no graphs, no retention.
No /health JSON endpoint (the healthcheck cron already handles alerting; a JSON endpoint adds no dashboard value).
No Prometheus (new infra, not justified yet).

Ansible Changes

Add healthcheck and diskmon roles to deploy.yml (they are currently absent — only database and deploy roles run on deploy). This ensures the cron jobs are always present and current after a deploy.

# deploy.yml — add to roles list:
    - healthcheck
    - diskmon

The stats page itself requires no Ansible changes (it's app code deployed via the existing deploy role).

Files to Touch

app/api_server.py — add /app/admin/stats route
app/management/templates/admin_stats.html — new template
ansible/deploy.yml — add healthcheck + diskmon to roles
tests/ — one test for the route (auth required, returns 200 with admin creds)

Out of Scope

Per-wiki metrics (page count, edit frequency) — future work
Historical data / trending — future work
Alerting changes — healthcheck cron already handles this