commit ee9dcb

Commit `ee9dcb`

2026-03-18 03:27:06 Claude (MCP): [mcp] Add monitoring dashboard plan

0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9

`/dev/null` .. `Plans/Monitoring_Dashboard.md`
@@ 0,0 1,76 @@
+	---
+	category: spec
+	tags: [monitoring, observability, plan]
+	last_updated: 2026-03-18
+	confidence: high
+	---
+
+	# Monitoring Dashboard Plan
+
+	## Current State
+
+	Infrastructure already in place:
+	- healthcheck role: cron every minute, checks all 4 services (systemctl + HTTP), emails on failure/recovery via msmtp. Binary up/down only.
+	- diskmon role: cron checks `/` and `/srv` at 80%/90% thresholds, emails alerts.
+	- smoke-test.sh: post-deploy liveness + content checks for all 4 services + each wiki slug.
+	- No health endpoints in any app code (api_server.py, auth_server.py, mcp_entry.py have no `/health` or `/status` routes).
+	- No metrics collection (no Prometheus, no structured logs beyond journald).
+
+	## Recommendation: Option C — `/app/admin/stats` page in the management UI
+
+	Rationale: The management UI (port 8002, `/app/*`) already exists with auth, templates, and DB access. A stats page requires zero new infrastructure and is immediately accessible in-browser. Option A (enhanced health endpoint) gives machine-readable data but no dashboard. Option B (Prometheus) is new infra.
+
+	## What to Build
+
+	One new route in `app/api_server.py`: `GET /app/admin/stats`
+
+	Access control: Require a valid platform JWT cookie AND `is_admin` flag on the user record (or, simpler, hardcode a set of admin DIDs from env var `ROBOT_ADMIN_DIDS`).
+
+	Page content (all point-in-time, no history):
+
+	\| Section \| Source \|
+	\|---------\|--------\|
+	\| Service status \| `systemctl is-active` for each of the 4 units \|
+	\| HTTP liveness \| `curl -s -o /dev/null -w "%{http_code}" http://localhost:{port}/` for each \|
+	\| Disk usage \| `df -h /srv` \|
+	\| Wiki count \| `SELECT COUNT(*) FROM wikis` \|
+	\| User count \| `SELECT COUNT(*) FROM users` \|
+	\| Recent wikis \| `SELECT slug, created_at FROM wikis ORDER BY created_at DESC LIMIT 10` \|
+	\| Journal tail \| `journalctl -u robot-otterwiki -u robot-api -u robot-auth -u robot-mcp -n 50 --no-pager` \|
+
+	Implementation notes:
+	- Call `subprocess.run` with `capture_output=True` for systemctl/df/journalctl — same pattern already used in `_init_wiki_repo`.
+	- Use existing Jinja2 template system (Bootstrap already pulled in via otterwiki static assets).
+	- No JS required. Plain HTML table.
+	- Rate-limit the route (already have `flask_limiter` wired up).
+
+	## What NOT to Build
+
+	- No time-series data, no graphs, no retention.
+	- No `/health` JSON endpoint (the healthcheck cron already handles alerting; a JSON endpoint adds no dashboard value).
+	- No Prometheus (new infra, not justified yet).
+
+	## Ansible Changes
+
+	Add `healthcheck` and `diskmon` roles to `deploy.yml` (they are currently absent — only `database` and `deploy` roles run on deploy). This ensures the cron jobs are always present and current after a deploy.
+
+	```yaml
+	# deploy.yml — add to roles list:
+	- healthcheck
+	- diskmon
+	```
+
+	The stats page itself requires no Ansible changes (it's app code deployed via the existing `deploy` role).
+
+	## Files to Touch
+
+	- `app/api_server.py` — add `/app/admin/stats` route
+	- `app/management/templates/admin_stats.html` — new template
+	- `ansible/deploy.yml` — add `healthcheck` + `diskmon` to roles
+	- `tests/` — one test for the route (auth required, returns 200 with admin creds)
+
+	## Out of Scope
+
+	- Per-wiki metrics (page count, edit frequency) — future work
+	- Historical data / trending — future work
+	- Alerting changes — healthcheck cron already handles this

Commit ee9dcb

Commit `ee9dcb`