Commit 6a6f0e

2026-03-19 05:14:02 Claude (MCP): [mcp] Add custom domain design document
/dev/null .. Design/Custom_Domains.md
@@ 0,0 1,145 @@
+ ---
+ category: design
+ tags: [infrastructure, caddy, auth, multi-tenant]
+ last_updated: 2026-03-19
+ confidence: medium
+ ---
+
+ # Custom Domains
+
+ Allow users to serve their wiki from a domain they own (e.g., `wiki.example.com`) instead of `{slug}.robot.wtf`.
+
+ ## Scope
+
+ - Subdomains only for v1 (e.g., `wiki.example.com`). Apex domains (`example.com`) require ALIAS/ANAME records which are provider-dependent and not universally supported.
+ - One custom domain per wiki. The schema supports multiple, but the UI enforces one. Can relax later.
+ - MCP works unchanged through custom domains (bearer token auth, no cookie dependency).
+
+ ## Database Schema
+
+ New `custom_domains` table in `robot.db`:
+
+ ```sql
+ CREATE TABLE IF NOT EXISTS custom_domains (
+ domain TEXT PRIMARY KEY,
+ wiki_slug TEXT NOT NULL REFERENCES wikis(slug) ON DELETE CASCADE,
+ verification_status TEXT NOT NULL DEFAULT 'pending', -- pending | verified | active
+ verification_token TEXT NOT NULL,
+ verified_at TEXT,
+ created_at TEXT NOT NULL
+ );
+ CREATE INDEX IF NOT EXISTS ix_custom_domains_slug ON custom_domains(wiki_slug);
+ ```
+
+ Separate table (not a column on `wikis`) because domain verification has its own lifecycle and metadata.
+
+ ## DNS Verification
+
+ User must create two DNS records:
+
+ 1. **CNAME**: `wiki.example.com CNAME {slug}.robot.wtf.` (routes traffic)
+ 2. **TXT**: `_robotwtf-verify.wiki.example.com TXT "robotwtf-verify={verification_token}"` (proves ownership)
+
+ CNAME alone is insufficient — anyone could temporarily point a CNAME. The TXT prefix `_robotwtf-verify` avoids collision with other TXT records.
+
+ Verification uses `dnspython` (new dependency). Flow:
+ 1. User enters domain in settings UI → server generates token, stores as `pending`
+ 2. UI shows required DNS records
+ 3. User clicks "Verify" → server checks both CNAME and TXT
+ 4. Both pass → status becomes `active`
+
+ Periodic re-verification (cron) to detect removed CNAME records is desirable but not required for v1.
+
+ ## TLS (Caddy)
+
+ Caddy's on-demand TLS with the existing `ask` endpoint handles this. Modify `/api/internal/check-slug` to also accept custom domains:
+
+ 1. If domain ends with `.{PLATFORM_DOMAIN}`, do existing slug lookup
+ 2. Otherwise, look up domain in `custom_domains` where `verification_status = 'active'`
+ 3. Return 200 if found, 404 if not
+
+ Caddy automatically obtains Let's Encrypt certificates for any domain that passes the ask check. No Caddyfile changes beyond ensuring the on-demand TLS block is configured (may already be).
+
+ ## Tenant Resolution
+
+ `TenantResolver.__call__()` gains a fallback path:
+
+ 1. Try `_parse_host(host)` as today → returns slug for `{slug}.robot.wtf`
+ 2. If None, look up host in `custom_domains` where status is `active`
+ 3. If found, use the associated `wiki_slug`
+ 4. If neither, 404
+
+ Performance: in-memory cache (`{domain: slug}` dict) with 60-second TTL. Invalidated on domain add/remove. Multiple gunicorn workers each maintain their own cache — short TTL makes this acceptable.
+
+ Set `environ['CUSTOM_DOMAIN'] = domain` when serving via custom domain so downstream code (auth, link generation) can detect it.
+
+ ## Authentication on Custom Domains
+
+ This is the hard part. The `platform_token` cookie is set on `.robot.wtf` and won't be sent to `wiki.example.com`.
+
+ ### Solution: Redirect-based auth relay
+
+ Standard pattern used by GitHub Pages, Notion, etc.
+
+ 1. Unauthenticated user visits `wiki.example.com`
+ 2. Wiki requires auth → redirect to `https://robot.wtf/auth/login?return_to=https://wiki.example.com/...`
+ 3. User authenticates on `robot.wtf` (cookie set on `.robot.wtf`)
+ 4. Auth callback detects `return_to` is a custom domain
+ 5. Generates a **relay token**: signed JWT with user claims, `domain` claim, 60-second expiry, single-use nonce
+ 6. Redirects to `https://wiki.example.com/_auth/relay?token={relay_token}`
+ 7. `/_auth/relay` handler validates token (signature, expiry, domain match, nonce), sets `platform_token` cookie scoped to `wiki.example.com`, redirects to original page
+
+ ### Auth changes required
+
+ - `_is_safe_return_url()` must accept verified custom domains (query `custom_domains`)
+ - Auth callback generates relay token when `return_to` is a custom domain
+ - New `/_auth/relay` route in resolver (or dedicated handler)
+ - `TenantResolver._resolve_auth()` checks domain-scoped cookie (same name `platform_token`, browser sends the right one based on domain)
+
+ ### Relay token security
+
+ - Signed with the platform's RSA key (same as `PlatformJWT`)
+ - 60-second expiry
+ - Single-use: nonce stored in DB, consumed on use
+ - Domain-bound: `domain` claim must match the request's Host header
+ - No open redirect: final redirect path embedded in token, validated
+
+ ### Logout
+
+ Logging out on `robot.wtf` clears the `.robot.wtf` cookie but not the `wiki.example.com` cookie. Mitigation: set custom domain cookies with a 1-hour max-age (vs 24h for the platform cookie). Stale sessions are short-lived.
+
+ ## Management UI
+
+ "Custom Domain" card on `wiki_settings.html`:
+
+ **No domain configured:**
+ - Text input + "Add Domain" button
+
+ **Pending verification:**
+ - Show required DNS records (copyable)
+ - "Check DNS" button
+ - "Remove" button
+
+ **Active:**
+ - Domain with green status badge
+ - "Remove" button
+
+ Backend routes:
+ - `POST /app/wiki/<slug>/domain` — add
+ - `POST /app/wiki/<slug>/domain/verify` — check DNS
+ - `POST /app/wiki/<slug>/domain/remove` — remove
+
+ ## Implementation Phases
+
+ 1. Schema + `CustomDomainModel` + DNS verification logic + tests
+ 2. Modify `check-slug` endpoint for Caddy integration
+ 3. Resolver custom domain lookup + cache
+ 4. Auth relay (hardest phase)
+ 5. Management UI
+
+ ## Risks
+
+ - **Auth relay is a new attack surface.** Must be cryptographically signed, time-limited, single-use, domain-bound.
+ - **DNS propagation delays.** Users may verify before records propagate. UI should explain this and allow re-checking.
+ - **Let's Encrypt rate limits.** 50 certs per registered domain per week. Unlikely at current scale.
+ - **Cache invalidation across workers.** Short TTL (60s) is the simplest correct approach.
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9