Properties

category: design
tags: [architecture, infrastructure, refactoring]
last_updated: 2026-03-19
confidence: high

Server Consolidation: Merge auth_server + api_server

Merge the auth service (port 8003) and management/API service (port 8002) into a single Flask app on port 8002. Otterwiki (port 8000) and MCP sidecar (port 8001) remain separate.

Motivation

auth_server and api_server are tightly coupled:

Same database (robot.db)
Same models (UserModel, WikiModel)
Same signing keys (RSA for JWT, EC for ATProto client)
Same cookie (platform_token on .robot.wtf)
Same user identity model (DID-based)

The separation creates concrete problems:

E2E testing: Cookies set on port 8003 aren't sent to port 8002. Standing up two Flask servers in test fixtures requires cross-thread SQLite coordination. An implementation agent burned its entire context trying to solve this.
Operational overhead: Two systemd services, two Gunicorn configs, two health checks for what is logically one "platform" service.
Code duplication: Both apps call _load_keys(), get_connection(), init_schema() independently. Both have their own rate limiting setup.

Current Architecture

Caddy (TLS, port 80/443)
├─ robot.wtf/auth/*          → robot-auth (Gunicorn, port 8003, auth_server.py)
├─ robot.wtf/app/*           → robot-api  (Gunicorn, port 8002, api_server.py)
├─ robot.wtf/api/*           → robot-api  (port 8002)
├─ {slug}.robot.wtf/mcp      → robot-mcp  (uvicorn, port 8001)
├─ {slug}.robot.wtf/api/v1/* → robot-otterwiki (Gunicorn, port 8000, wsgi.py)
└─ {slug}.robot.wtf/*        → robot-otterwiki (port 8000)

Four processes, three entry points (auth_server:application, api_server:application, wsgi:application), plus the MCP sidecar.

Target Architecture

Caddy (TLS, port 80/443)
├─ robot.wtf/auth/*          → robot-platform (Gunicorn, port 8002, platform_server.py)
├─ robot.wtf/app/*           → robot-platform (port 8002)
├─ robot.wtf/api/*           → robot-platform (port 8002)
├─ {slug}.robot.wtf/mcp      → robot-mcp  (uvicorn, port 8001)
├─ {slug}.robot.wtf/api/v1/* → robot-otterwiki (Gunicorn, port 8000, wsgi.py)
└─ {slug}.robot.wtf/*        → robot-otterwiki (port 8000)

Three processes, two entry points. Caddy routing unchanged (both /auth/* and /app/* already route to the same IP, just different ports — changing to same port is a one-line edit).

What Changes

New: `app/platform_server.py`

Single Flask app factory that combines auth and management routes:

def create_app(*, db_path=None, client_jwk_path=None, signing_key_path=None):
    app = Flask(__name__, template_folder="templates")
    # templates/auth/ — login.html, consent.html, error.html, base.html
    # templates/management/ — layout.html, wiki_create.html, etc.

    # Shared setup: secret key, keys, DB, models, rate limiter
    ...

    # Auth routes (/auth/*)
    _register_auth_routes(app, platform_jwt, client_secret_jwk, ...)

    # Management UI routes (/app/*)
    _register_management_ui_routes(app, platform_jwt, wiki_model, user_model, ...)

    # Management API routes (/api/*)
    _register_management_api_routes(app, wiki_model, user_model, ...)

    # Well-known routes
    _register_wellknown_routes(app, ...)

    return app

The route registration functions extract the existing route definitions from auth_server.py and api_server.py into callable functions that take a Flask app and shared dependencies as arguments.

Removed

app/auth_server.py — routes moved to platform_server.py
ansible/roles/deploy/templates/robot-auth.service.j2 — systemd service removed
ansible/roles/deploy/templates/gunicorn-auth.conf.py.j2 — Gunicorn config removed

Modified

app/api_server.py → renamed/merged into platform_server.py
ansible/roles/deploy/templates/Caddyfile.j2 — remove auth port, route /auth/* to port 8002
ansible/roles/deploy/tasks/main.yml — remove auth service deployment
app/auth/templates/ — move to app/templates/auth/
app/management/templates/ — move to app/templates/management/

Unchanged

app/wsgi.py — otterwiki entry point, completely independent
app/resolver.py — TenantResolver wraps otterwiki, not the platform service
app/management/routes.py — ManagementMiddleware still wraps the platform Flask app
All auth logic — unchanged, just relocated
All management logic — unchanged
Database schema — unchanged
MCP sidecar — unchanged

ManagementMiddleware Handling

Currently, api_server.py wraps the Flask app with ManagementMiddleware (a WSGI middleware that intercepts /api/* for rate limiting and auth). Auth routes don't go through this middleware.

After consolidation, ManagementMiddleware still wraps the combined Flask app. It already passes through paths it doesn't handle — /auth/* routes will pass through to Flask unchanged. No middleware changes needed.

Verify by reading ManagementMiddleware's __call__ — it only intercepts paths matching its configured prefixes (/api/). All other paths pass to the wrapped app.

Template Directory Structure

Before:

app/auth/templates/     — base.html, login.html, consent.html, error.html
app/management/templates/ — layout.html, wiki_create.html, wiki_settings.html, account.html

After:

app/templates/
  auth/        — base.html, login.html, consent.html, error.html
  management/  — layout.html, wiki_create.html, wiki_settings.html, account.html

Template references in route code change from render_template("login.html") to render_template("auth/login.html"). Mechanical find-and-replace.

Database Connection Strategy

Both apps currently use get_connection() which opens a new SQLite connection per call. The consolidated app continues this pattern — one connection per request via Flask's g object and teardown_appcontext.

The auth_server pattern (_get_db() storing in g._database) is cleaner than api_server's approach (connection at startup). Adopt the per-request pattern throughout.

Rate Limiting

Auth routes: Flask-Limiter with per-route decorators (@limiter.limit("1/minute"))
Management API routes: WSGIRateLimiter singleton in ManagementMiddleware

Both can coexist — Flask-Limiter operates at the Flask level, WSGIRateLimiter at the WSGI level. No conflict.

One Flask app = one secret_key = one session. The platform_token cookie is set with domain=COOKIE_DOMAIN, which is the same regardless of which routes set it. No changes needed.

E2E Testing Impact

The consolidation directly unblocks E2E testing:

One server fixture instead of two
Cookies work naturally (same origin)
No SQLite cross-thread issues
authenticated_page fixture just logs in and the cookie works for all routes
The 11 planned E2E tests become straightforward

Implementation Sequence

Create app/platform_server.py with combined app factory
Move templates to app/templates/{auth,management}/
Update render_template() calls with subdirectory prefixes
Verify all existing unit tests pass against the new structure
Update Ansible: remove auth service, update Caddy routes
Deploy and verify
Remove old auth_server.py and api_server.py
Resume E2E test implementation with simplified fixtures

Status: COMPLETE — Deployed to production 2026-03-20. Caddy updated on proxy-1. All services healthy.

Risks and Review Findings

Plan reviewed 2026-03-19. Core approach confirmed sound. Key findings:

Template migration (important)

Not just render_template() calls — the {% extends %} directives inside templates must also update:

Auth templates: {% extends 'base.html' %} → {% extends 'auth/base.html' %}
Management templates: {% extends "layout.html" %} → {% extends "management/layout.html" %}

6 auth + 4 management template files affected.

App factory interface (resolved)

Single create_app() factory returns the fully-wrapped WSGI app. Module-level application = create_app() for Gunicorn (app.platform_server:application).

Auth server applies ProxyFix inside Flask; API server applies it outermost. Consolidated app: apply once, outermost (api_server pattern).
Tests call create_app(db_path=..., ...) with overrides as they do today.
Actual import callsites across tests: ~17 (not ~46 as initially estimated). 3 test files, straightforward find-and-replace.

Error handlers (resolved)

The 429 handler is a verbatim copy (content-negotiated JSON/HTML). The 400/500 handlers only exist in auth_server and render HTML via error.html. No conflict: API routes go through ManagementMiddleware (raw WSGI, handles its own errors), so Flask error handlers only fire for /auth/* and /app/* routes. Keep one set of handlers, move error.html to app/templates/error.html.

Deployment strategy (important)

Correct zero-downtime approach:

Deploy robot-platform on port 8002 (new service)
Keep robot-auth on 8003 running temporarily
Verify platform handles /app/* and /api/*
Update Caddy to route /auth/* to 8002
Stop and disable robot-auth

Caddyfile management (resolved)

The Caddyfile lives on proxy-1 and is managed separately from this repo. During cutover, the proxy-1 agent needs one change: route robot.wtf/auth/* to port 8002 instead of 8003. Timing: after robot-platform is verified on 8002, before robot-auth is stopped.

Hardcoded service references (important)

admin_stats() hardcodes ["robot-otterwiki", "robot-mcp", "robot-api", "robot-auth"] for systemctl/journalctl — must update
smoke-test.sh hardcodes port 8003 checks — must update
healthcheck role defaults reference port 8003 — must update

Environment variables (important)

CLIENT_JWK_PATH is only in robot-auth.service env. Must propagate to consolidated service.

ProxyFix (minor)

Apply once at outermost WSGI layer. Don't accidentally apply twice.

Python packages (resolved)

app/auth/ and app/management/ stay as-is. They represent distinct domains (auth infrastructure vs wiki lifecycle CRUD). The consolidation merges entry points, not packages. platform_server.py imports from both.

Entanglement (resolved)

Only 2 systemd service files and 3 test files reference auth_server/api_server. No other app code imports from them. Ansible handlers for service restarts need updating. Clean separation, low-risk rename.

Confirmed correct

ManagementMiddleware passes through /auth/* (line 162-165)
Flask-Limiter + WSGIRateLimiter coexist without conflict
Static file serving has no conflicts