Blame
|
1 | --- |
||||||
| 2 | category: spec |
|||||||
| 3 | tags: [monitoring, observability, plan] |
|||||||
| 4 | last_updated: 2026-03-18 |
|||||||
| 5 | confidence: high |
|||||||
| 6 | --- |
|||||||
| 7 | ||||||||
| 8 | # Monitoring Dashboard Plan |
|||||||
| 9 | ||||||||
| 10 | ## Current State |
|||||||
| 11 | ||||||||
| 12 | Infrastructure already in place: |
|||||||
| 13 | - **healthcheck role**: cron every minute, checks all 4 services (systemctl + HTTP), emails on failure/recovery via msmtp. Binary up/down only. |
|||||||
| 14 | - **diskmon role**: cron checks `/` and `/srv` at 80%/90% thresholds, emails alerts. |
|||||||
| 15 | - **smoke-test.sh**: post-deploy liveness + content checks for all 4 services + each wiki slug. |
|||||||
| 16 | - **No health endpoints** in any app code (api_server.py, auth_server.py, mcp_entry.py have no `/health` or `/status` routes). |
|||||||
| 17 | - **No metrics collection** (no Prometheus, no structured logs beyond journald). |
|||||||
| 18 | ||||||||
| 19 | ## Recommendation: Option C — `/app/admin/stats` page in the management UI |
|||||||
| 20 | ||||||||
| 21 | Rationale: The management UI (port 8002, `/app/*`) already exists with auth, templates, and DB access. A stats page requires zero new infrastructure and is immediately accessible in-browser. Option A (enhanced health endpoint) gives machine-readable data but no dashboard. Option B (Prometheus) is new infra. |
|||||||
| 22 | ||||||||
| 23 | ## What to Build |
|||||||
| 24 | ||||||||
| 25 | One new route in `app/api_server.py`: `GET /app/admin/stats` |
|||||||
| 26 | ||||||||
| 27 | **Access control**: Require a valid platform JWT cookie AND `is_admin` flag on the user record (or, simpler, hardcode a set of admin DIDs from env var `ROBOT_ADMIN_DIDS`). |
|||||||
| 28 | ||||||||
| 29 | **Page content** (all point-in-time, no history): |
|||||||
| 30 | ||||||||
| 31 | | Section | Source | |
|||||||
| 32 | |---------|--------| |
|||||||
| 33 | | Service status | `systemctl is-active` for each of the 4 units | |
|||||||
| 34 | | HTTP liveness | `curl -s -o /dev/null -w "%{http_code}" http://localhost:{port}/` for each | |
|||||||
| 35 | | Disk usage | `df -h /srv` | |
|||||||
| 36 | | Wiki count | `SELECT COUNT(*) FROM wikis` | |
|||||||
| 37 | | User count | `SELECT COUNT(*) FROM users` | |
|||||||
| 38 | | Recent wikis | `SELECT slug, created_at FROM wikis ORDER BY created_at DESC LIMIT 10` | |
|||||||
| 39 | | Journal tail | `journalctl -u robot-otterwiki -u robot-api -u robot-auth -u robot-mcp -n 50 --no-pager` | |
|||||||
| 40 | ||||||||
| 41 | **Implementation notes**: |
|||||||
| 42 | - Call `subprocess.run` with `capture_output=True` for systemctl/df/journalctl — same pattern already used in `_init_wiki_repo`. |
|||||||
| 43 | - Use existing Jinja2 template system (Bootstrap already pulled in via otterwiki static assets). |
|||||||
| 44 | - No JS required. Plain HTML table. |
|||||||
| 45 | - Rate-limit the route (already have `flask_limiter` wired up). |
|||||||
| 46 | ||||||||
| 47 | ## What NOT to Build |
|||||||
| 48 | ||||||||
| 49 | - No time-series data, no graphs, no retention. |
|||||||
| 50 | - No `/health` JSON endpoint (the healthcheck cron already handles alerting; a JSON endpoint adds no dashboard value). |
|||||||
| 51 | - No Prometheus (new infra, not justified yet). |
|||||||
| 52 | ||||||||
| 53 | ## Ansible Changes |
|||||||
| 54 | ||||||||
| 55 | Add `healthcheck` and `diskmon` roles to `deploy.yml` (they are currently absent — only `database` and `deploy` roles run on deploy). This ensures the cron jobs are always present and current after a deploy. |
|||||||
| 56 | ||||||||
| 57 | ```yaml |
|||||||
| 58 | # deploy.yml — add to roles list: |
|||||||
| 59 | - healthcheck |
|||||||
| 60 | - diskmon |
|||||||
| 61 | ``` |
|||||||
| 62 | ||||||||
| 63 | The stats page itself requires no Ansible changes (it's app code deployed via the existing `deploy` role). |
|||||||
| 64 | ||||||||
| 65 | ## Files to Touch |
|||||||
| 66 | ||||||||
| 67 | - `app/api_server.py` — add `/app/admin/stats` route |
|||||||
| 68 | - `app/management/templates/admin_stats.html` — new template |
|||||||
| 69 | - `ansible/deploy.yml` — add `healthcheck` + `diskmon` to roles |
|||||||
| 70 | - `tests/` — one test for the route (auth required, returns 200 with admin creds) |
|||||||
| 71 | ||||||||
| 72 | ## Out of Scope |
|||||||
| 73 | ||||||||
| 74 | - Per-wiki metrics (page count, edit frequency) — future work |
|||||||
| 75 | - Historical data / trending — future work |
|||||||
| 76 | - Alerting changes — healthcheck cron already handles this |
|||||||
