Blame
|
1 | --- |
||||||
| 2 | category: design |
|||||||
| 3 | tags: [architecture, infrastructure, refactoring] |
|||||||
| 4 | last_updated: 2026-03-19 |
|||||||
| 5 | confidence: high |
|||||||
| 6 | --- |
|||||||
| 7 | ||||||||
| 8 | # Server Consolidation: Merge auth_server + api_server |
|||||||
| 9 | ||||||||
| 10 | Merge the auth service (port 8003) and management/API service (port 8002) into a single Flask app on port 8002. Otterwiki (port 8000) and MCP sidecar (port 8001) remain separate. |
|||||||
| 11 | ||||||||
| 12 | ## Motivation |
|||||||
| 13 | ||||||||
| 14 | auth_server and api_server are tightly coupled: |
|||||||
| 15 | - Same database (robot.db) |
|||||||
| 16 | - Same models (UserModel, WikiModel) |
|||||||
| 17 | - Same signing keys (RSA for JWT, EC for ATProto client) |
|||||||
| 18 | - Same cookie (platform_token on .robot.wtf) |
|||||||
| 19 | - Same user identity model (DID-based) |
|||||||
| 20 | ||||||||
| 21 | The separation creates concrete problems: |
|||||||
| 22 | - **E2E testing**: Cookies set on port 8003 aren't sent to port 8002. Standing up two Flask servers in test fixtures requires cross-thread SQLite coordination. An implementation agent burned its entire context trying to solve this. |
|||||||
| 23 | - **Operational overhead**: Two systemd services, two Gunicorn configs, two health checks for what is logically one "platform" service. |
|||||||
| 24 | - **Code duplication**: Both apps call `_load_keys()`, `get_connection()`, `init_schema()` independently. Both have their own rate limiting setup. |
|||||||
| 25 | ||||||||
| 26 | ## Current Architecture |
|||||||
| 27 | ||||||||
| 28 | ``` |
|||||||
| 29 | Caddy (TLS, port 80/443) |
|||||||
| 30 | ├─ robot.wtf/auth/* → robot-auth (Gunicorn, port 8003, auth_server.py) |
|||||||
| 31 | ├─ robot.wtf/app/* → robot-api (Gunicorn, port 8002, api_server.py) |
|||||||
| 32 | ├─ robot.wtf/api/* → robot-api (port 8002) |
|||||||
| 33 | ├─ {slug}.robot.wtf/mcp → robot-mcp (uvicorn, port 8001) |
|||||||
| 34 | ├─ {slug}.robot.wtf/api/v1/* → robot-otterwiki (Gunicorn, port 8000, wsgi.py) |
|||||||
| 35 | └─ {slug}.robot.wtf/* → robot-otterwiki (port 8000) |
|||||||
| 36 | ``` |
|||||||
| 37 | ||||||||
| 38 | Four processes, three entry points (`auth_server:application`, `api_server:application`, `wsgi:application`), plus the MCP sidecar. |
|||||||
| 39 | ||||||||
| 40 | ## Target Architecture |
|||||||
| 41 | ||||||||
| 42 | ``` |
|||||||
| 43 | Caddy (TLS, port 80/443) |
|||||||
| 44 | ├─ robot.wtf/auth/* → robot-platform (Gunicorn, port 8002, platform_server.py) |
|||||||
| 45 | ├─ robot.wtf/app/* → robot-platform (port 8002) |
|||||||
| 46 | ├─ robot.wtf/api/* → robot-platform (port 8002) |
|||||||
| 47 | ├─ {slug}.robot.wtf/mcp → robot-mcp (uvicorn, port 8001) |
|||||||
| 48 | ├─ {slug}.robot.wtf/api/v1/* → robot-otterwiki (Gunicorn, port 8000, wsgi.py) |
|||||||
| 49 | └─ {slug}.robot.wtf/* → robot-otterwiki (port 8000) |
|||||||
| 50 | ``` |
|||||||
| 51 | ||||||||
| 52 | Three processes, two entry points. Caddy routing unchanged (both `/auth/*` and `/app/*` already route to the same IP, just different ports — changing to same port is a one-line edit). |
|||||||
| 53 | ||||||||
| 54 | ## What Changes |
|||||||
| 55 | ||||||||
| 56 | ### New: `app/platform_server.py` |
|||||||
| 57 | ||||||||
| 58 | Single Flask app factory that combines auth and management routes: |
|||||||
| 59 | ||||||||
| 60 | ```python |
|||||||
| 61 | def create_app(*, db_path=None, client_jwk_path=None, signing_key_path=None): |
|||||||
| 62 | app = Flask(__name__, template_folder="templates") |
|||||||
| 63 | # templates/auth/ — login.html, consent.html, error.html, base.html |
|||||||
| 64 | # templates/management/ — layout.html, wiki_create.html, etc. |
|||||||
| 65 | ||||||||
| 66 | # Shared setup: secret key, keys, DB, models, rate limiter |
|||||||
| 67 | ... |
|||||||
| 68 | ||||||||
| 69 | # Auth routes (/auth/*) |
|||||||
| 70 | _register_auth_routes(app, platform_jwt, client_secret_jwk, ...) |
|||||||
| 71 | ||||||||
| 72 | # Management UI routes (/app/*) |
|||||||
| 73 | _register_management_ui_routes(app, platform_jwt, wiki_model, user_model, ...) |
|||||||
| 74 | ||||||||
| 75 | # Management API routes (/api/*) |
|||||||
| 76 | _register_management_api_routes(app, wiki_model, user_model, ...) |
|||||||
| 77 | ||||||||
| 78 | # Well-known routes |
|||||||
| 79 | _register_wellknown_routes(app, ...) |
|||||||
| 80 | ||||||||
| 81 | return app |
|||||||
| 82 | ``` |
|||||||
| 83 | ||||||||
| 84 | The route registration functions extract the existing route definitions from `auth_server.py` and `api_server.py` into callable functions that take a Flask app and shared dependencies as arguments. |
|||||||
| 85 | ||||||||
| 86 | ### Removed |
|||||||
| 87 | - `app/auth_server.py` — routes moved to platform_server.py |
|||||||
| 88 | - `ansible/roles/deploy/templates/robot-auth.service.j2` — systemd service removed |
|||||||
| 89 | - `ansible/roles/deploy/templates/gunicorn-auth.conf.py.j2` — Gunicorn config removed |
|||||||
| 90 | ||||||||
| 91 | ### Modified |
|||||||
| 92 | - `app/api_server.py` → renamed/merged into `platform_server.py` |
|||||||
| 93 | - `ansible/roles/deploy/templates/Caddyfile.j2` — remove auth port, route `/auth/*` to port 8002 |
|||||||
| 94 | - `ansible/roles/deploy/tasks/main.yml` — remove auth service deployment |
|||||||
| 95 | - `app/auth/templates/` — move to `app/templates/auth/` |
|||||||
| 96 | - `app/management/templates/` — move to `app/templates/management/` |
|||||||
| 97 | ||||||||
| 98 | ### Unchanged |
|||||||
| 99 | - `app/wsgi.py` — otterwiki entry point, completely independent |
|||||||
| 100 | - `app/resolver.py` — TenantResolver wraps otterwiki, not the platform service |
|||||||
| 101 | - `app/management/routes.py` — ManagementMiddleware still wraps the platform Flask app |
|||||||
| 102 | - All auth logic — unchanged, just relocated |
|||||||
| 103 | - All management logic — unchanged |
|||||||
| 104 | - Database schema — unchanged |
|||||||
| 105 | - MCP sidecar — unchanged |
|||||||
| 106 | ||||||||
| 107 | ## ManagementMiddleware Handling |
|||||||
| 108 | ||||||||
| 109 | Currently, `api_server.py` wraps the Flask app with ManagementMiddleware (a WSGI middleware that intercepts `/api/*` for rate limiting and auth). Auth routes don't go through this middleware. |
|||||||
| 110 | ||||||||
| 111 | After consolidation, ManagementMiddleware still wraps the combined Flask app. It already passes through paths it doesn't handle — `/auth/*` routes will pass through to Flask unchanged. No middleware changes needed. |
|||||||
| 112 | ||||||||
| 113 | Verify by reading ManagementMiddleware's `__call__` — it only intercepts paths matching its configured prefixes (`/api/`). All other paths pass to the wrapped app. |
|||||||
| 114 | ||||||||
| 115 | ## Template Directory Structure |
|||||||
| 116 | ||||||||
| 117 | Before: |
|||||||
| 118 | ``` |
|||||||
| 119 | app/auth/templates/ — base.html, login.html, consent.html, error.html |
|||||||
| 120 | app/management/templates/ — layout.html, wiki_create.html, wiki_settings.html, account.html |
|||||||
| 121 | ``` |
|||||||
| 122 | ||||||||
| 123 | After: |
|||||||
| 124 | ``` |
|||||||
| 125 | app/templates/ |
|||||||
| 126 | auth/ — base.html, login.html, consent.html, error.html |
|||||||
| 127 | management/ — layout.html, wiki_create.html, wiki_settings.html, account.html |
|||||||
| 128 | ``` |
|||||||
| 129 | ||||||||
| 130 | Template references in route code change from `render_template("login.html")` to `render_template("auth/login.html")`. Mechanical find-and-replace. |
|||||||
| 131 | ||||||||
| 132 | ## Database Connection Strategy |
|||||||
| 133 | ||||||||
| 134 | Both apps currently use `get_connection()` which opens a new SQLite connection per call. The consolidated app continues this pattern — one connection per request via Flask's `g` object and `teardown_appcontext`. |
|||||||
| 135 | ||||||||
| 136 | The auth_server pattern (`_get_db()` storing in `g._database`) is cleaner than api_server's approach (connection at startup). Adopt the per-request pattern throughout. |
|||||||
| 137 | ||||||||
| 138 | ## Rate Limiting |
|||||||
| 139 | ||||||||
| 140 | - Auth routes: Flask-Limiter with per-route decorators (`@limiter.limit("1/minute")`) |
|||||||
| 141 | - Management API routes: WSGIRateLimiter singleton in ManagementMiddleware |
|||||||
| 142 | ||||||||
| 143 | Both can coexist — Flask-Limiter operates at the Flask level, WSGIRateLimiter at the WSGI level. No conflict. |
|||||||
| 144 | ||||||||
| 145 | ## Session and Cookie |
|||||||
| 146 | ||||||||
| 147 | One Flask app = one `secret_key` = one session. The platform_token cookie is set with `domain=COOKIE_DOMAIN`, which is the same regardless of which routes set it. No changes needed. |
|||||||
| 148 | ||||||||
| 149 | ## E2E Testing Impact |
|||||||
| 150 | ||||||||
| 151 | The consolidation directly unblocks E2E testing: |
|||||||
| 152 | - One server fixture instead of two |
|||||||
| 153 | - Cookies work naturally (same origin) |
|||||||
| 154 | - No SQLite cross-thread issues |
|||||||
| 155 | - `authenticated_page` fixture just logs in and the cookie works for all routes |
|||||||
| 156 | - The 11 planned E2E tests become straightforward |
|||||||
| 157 | ||||||||
| 158 | ## Implementation Sequence |
|||||||
| 159 | ||||||||
| 160 | 1. Create `app/platform_server.py` with combined app factory |
|||||||
| 161 | 2. Move templates to `app/templates/{auth,management}/` |
|||||||
| 162 | 3. Update `render_template()` calls with subdirectory prefixes |
|||||||
| 163 | 4. Verify all existing unit tests pass against the new structure |
|||||||
| 164 | 5. Update Ansible: remove auth service, update Caddy routes |
|||||||
| 165 | 6. Deploy and verify |
|||||||
| 166 | 7. Remove old `auth_server.py` and `api_server.py` |
|||||||
| 167 | 8. Resume E2E test implementation with simplified fixtures |
|||||||
| 168 | ||||||||
|
169 | ## Risks and Review Findings |
||||||
|
170 | |||||||
|
171 | Plan reviewed 2026-03-19. Core approach confirmed sound. Key findings: |
||||||
| 172 | ||||||||
| 173 | ### Template migration (important) |
|||||||
| 174 | ||||||||
| 175 | Not just `render_template()` calls — the `{% extends %}` directives inside templates must also update: |
|||||||
| 176 | - Auth templates: `{% extends 'base.html' %}` → `{% extends 'auth/base.html' %}` |
|||||||
| 177 | - Management templates: `{% extends "layout.html" %}` → `{% extends "management/layout.html" %}` |
|||||||
| 178 | ||||||||
| 179 | 6 auth + 4 management template files affected. |
|||||||
| 180 | ||||||||
|
181 | ### App factory interface (resolved) |
||||||
|
182 | |||||||
|
183 | Single `create_app()` factory returns the fully-wrapped WSGI app. Module-level `application = create_app()` for Gunicorn (`app.platform_server:application`). |
||||||
|
184 | |||||||
|
185 | - Auth server applies ProxyFix inside Flask; API server applies it outermost. Consolidated app: apply once, outermost (api_server pattern). |
||||||
| 186 | - Tests call `create_app(db_path=..., ...)` with overrides as they do today. |
|||||||
| 187 | - Actual import callsites across tests: ~17 (not ~46 as initially estimated). 3 test files, straightforward find-and-replace. |
|||||||
|
188 | |||||||
|
189 | ### Error handlers (resolved) |
||||||
|
190 | |||||||
|
191 | The 429 handler is a verbatim copy (content-negotiated JSON/HTML). The 400/500 handlers only exist in auth_server and render HTML via `error.html`. No conflict: API routes go through ManagementMiddleware (raw WSGI, handles its own errors), so Flask error handlers only fire for `/auth/*` and `/app/*` routes. Keep one set of handlers, move `error.html` to `app/templates/error.html`. |
||||||
|
192 | |||||||
| 193 | ### Deployment strategy (important) |
|||||||
| 194 | ||||||||
| 195 | Correct zero-downtime approach: |
|||||||
| 196 | 1. Deploy `robot-platform` on port 8002 (new service) |
|||||||
| 197 | 2. Keep `robot-auth` on 8003 running temporarily |
|||||||
| 198 | 3. Verify platform handles `/app/*` and `/api/*` |
|||||||
| 199 | 4. Update Caddy to route `/auth/*` to 8002 |
|||||||
| 200 | 5. Stop and disable `robot-auth` |
|||||||
| 201 | ||||||||
|
202 | ### Caddyfile management (resolved) |
||||||
|
203 | |||||||
|
204 | The Caddyfile lives on proxy-1 and is managed separately from this repo. During cutover, the proxy-1 agent needs one change: route `robot.wtf/auth/*` to port 8002 instead of 8003. Timing: after `robot-platform` is verified on 8002, before `robot-auth` is stopped. |
||||||
|
205 | |||||||
| 206 | ### Hardcoded service references (important) |
|||||||
| 207 | ||||||||
| 208 | - `admin_stats()` hardcodes `["robot-otterwiki", "robot-mcp", "robot-api", "robot-auth"]` for systemctl/journalctl — must update |
|||||||
| 209 | - `smoke-test.sh` hardcodes port 8003 checks — must update |
|||||||
| 210 | - `healthcheck` role defaults reference port 8003 — must update |
|||||||
| 211 | ||||||||
| 212 | ### Environment variables (important) |
|||||||
| 213 | ||||||||
| 214 | `CLIENT_JWK_PATH` is only in `robot-auth.service` env. Must propagate to consolidated service. |
|||||||
| 215 | ||||||||
| 216 | ### ProxyFix (minor) |
|||||||
| 217 | ||||||||
| 218 | Apply once at outermost WSGI layer. Don't accidentally apply twice. |
|||||||
| 219 | ||||||||
|
220 | ### Python packages (resolved) |
||||||
| 221 | ||||||||
| 222 | `app/auth/` and `app/management/` stay as-is. They represent distinct domains (auth infrastructure vs wiki lifecycle CRUD). The consolidation merges entry points, not packages. `platform_server.py` imports from both. |
|||||||
| 223 | ||||||||
| 224 | ### Entanglement (resolved) |
|||||||
| 225 | ||||||||
| 226 | Only 2 systemd service files and 3 test files reference `auth_server`/`api_server`. No other app code imports from them. Ansible handlers for service restarts need updating. Clean separation, low-risk rename. |
|||||||
| 227 | ||||||||
|
228 | ### Confirmed correct |
||||||
| 229 | ||||||||
| 230 | - ManagementMiddleware passes through `/auth/*` (line 162-165) |
|||||||
| 231 | - Flask-Limiter + WSGIRateLimiter coexist without conflict |
|||||||
| 232 | - Static file serving has no conflicts |
|||||||
