Blame

ee9dcb Claude (MCP) 2026-03-18 03:27:06
[mcp] Add monitoring dashboard plan
1
---
2
category: spec
3
tags: [monitoring, observability, plan]
4
last_updated: 2026-03-18
5
confidence: high
6
---
7
8
# Monitoring Dashboard Plan
9
10
## Current State
11
12
Infrastructure already in place:
13
- **healthcheck role**: cron every minute, checks all 4 services (systemctl + HTTP), emails on failure/recovery via msmtp. Binary up/down only.
14
- **diskmon role**: cron checks `/` and `/srv` at 80%/90% thresholds, emails alerts.
15
- **smoke-test.sh**: post-deploy liveness + content checks for all 4 services + each wiki slug.
16
- **No health endpoints** in any app code (api_server.py, auth_server.py, mcp_entry.py have no `/health` or `/status` routes).
17
- **No metrics collection** (no Prometheus, no structured logs beyond journald).
18
19
## Recommendation: Option C — `/app/admin/stats` page in the management UI
20
21
Rationale: The management UI (port 8002, `/app/*`) already exists with auth, templates, and DB access. A stats page requires zero new infrastructure and is immediately accessible in-browser. Option A (enhanced health endpoint) gives machine-readable data but no dashboard. Option B (Prometheus) is new infra.
22
23
## What to Build
24
25
One new route in `app/api_server.py`: `GET /app/admin/stats`
26
27
**Access control**: Require a valid platform JWT cookie AND `is_admin` flag on the user record (or, simpler, hardcode a set of admin DIDs from env var `ROBOT_ADMIN_DIDS`).
28
29
**Page content** (all point-in-time, no history):
30
31
| Section | Source |
32
|---------|--------|
33
| Service status | `systemctl is-active` for each of the 4 units |
34
| HTTP liveness | `curl -s -o /dev/null -w "%{http_code}" http://localhost:{port}/` for each |
35
| Disk usage | `df -h /srv` |
36
| Wiki count | `SELECT COUNT(*) FROM wikis` |
37
| User count | `SELECT COUNT(*) FROM users` |
38
| Recent wikis | `SELECT slug, created_at FROM wikis ORDER BY created_at DESC LIMIT 10` |
39
| Journal tail | `journalctl -u robot-otterwiki -u robot-api -u robot-auth -u robot-mcp -n 50 --no-pager` |
40
41
**Implementation notes**:
42
- Call `subprocess.run` with `capture_output=True` for systemctl/df/journalctl — same pattern already used in `_init_wiki_repo`.
43
- Use existing Jinja2 template system (Bootstrap already pulled in via otterwiki static assets).
44
- No JS required. Plain HTML table.
45
- Rate-limit the route (already have `flask_limiter` wired up).
46
47
## What NOT to Build
48
49
- No time-series data, no graphs, no retention.
50
- No `/health` JSON endpoint (the healthcheck cron already handles alerting; a JSON endpoint adds no dashboard value).
51
- No Prometheus (new infra, not justified yet).
52
53
## Ansible Changes
54
55
Add `healthcheck` and `diskmon` roles to `deploy.yml` (they are currently absent — only `database` and `deploy` roles run on deploy). This ensures the cron jobs are always present and current after a deploy.
56
57
```yaml
58
# deploy.yml — add to roles list:
59
- healthcheck
60
- diskmon
61
```
62
63
The stats page itself requires no Ansible changes (it's app code deployed via the existing `deploy` role).
64
65
## Files to Touch
66
67
- `app/api_server.py` — add `/app/admin/stats` route
68
- `app/management/templates/admin_stats.html` — new template
69
- `ansible/deploy.yml` — add `healthcheck` + `diskmon` to roles
70
- `tests/` — one test for the route (auth required, returns 200 with admin creds)
71
72
## Out of Scope
73
74
- Per-wiki metrics (page count, edit frequency) — future work
75
- Historical data / trending — future work
76
- Alerting changes — healthcheck cron already handles this