Blame
|
1 | --- |
||||||
| 2 | category: design |
|||||||
| 3 | tags: [infrastructure, caddy, auth, multi-tenant] |
|||||||
| 4 | last_updated: 2026-03-19 |
|||||||
| 5 | confidence: medium |
|||||||
| 6 | --- |
|||||||
| 7 | ||||||||
| 8 | # Custom Domains |
|||||||
| 9 | ||||||||
| 10 | Allow users to serve their wiki from a domain they own (e.g., `wiki.example.com`) instead of `{slug}.robot.wtf`. |
|||||||
| 11 | ||||||||
| 12 | ## Scope |
|||||||
| 13 | ||||||||
| 14 | - Subdomains only for v1 (e.g., `wiki.example.com`). Apex domains (`example.com`) require ALIAS/ANAME records which are provider-dependent and not universally supported. |
|||||||
| 15 | - One custom domain per wiki. The schema supports multiple, but the UI enforces one. Can relax later. |
|||||||
| 16 | - MCP works unchanged through custom domains (bearer token auth, no cookie dependency). |
|||||||
| 17 | ||||||||
| 18 | ## Database Schema |
|||||||
| 19 | ||||||||
| 20 | New `custom_domains` table in `robot.db`: |
|||||||
| 21 | ||||||||
| 22 | ```sql |
|||||||
| 23 | CREATE TABLE IF NOT EXISTS custom_domains ( |
|||||||
| 24 | domain TEXT PRIMARY KEY, |
|||||||
| 25 | wiki_slug TEXT NOT NULL REFERENCES wikis(slug) ON DELETE CASCADE, |
|||||||
| 26 | verification_status TEXT NOT NULL DEFAULT 'pending', -- pending | verified | active |
|||||||
| 27 | verification_token TEXT NOT NULL, |
|||||||
| 28 | verified_at TEXT, |
|||||||
| 29 | created_at TEXT NOT NULL |
|||||||
| 30 | ); |
|||||||
| 31 | CREATE INDEX IF NOT EXISTS ix_custom_domains_slug ON custom_domains(wiki_slug); |
|||||||
| 32 | ``` |
|||||||
| 33 | ||||||||
| 34 | Separate table (not a column on `wikis`) because domain verification has its own lifecycle and metadata. |
|||||||
| 35 | ||||||||
| 36 | ## DNS Verification |
|||||||
| 37 | ||||||||
| 38 | User must create two DNS records: |
|||||||
| 39 | ||||||||
| 40 | 1. **CNAME**: `wiki.example.com CNAME {slug}.robot.wtf.` (routes traffic) |
|||||||
| 41 | 2. **TXT**: `_robotwtf-verify.wiki.example.com TXT "robotwtf-verify={verification_token}"` (proves ownership) |
|||||||
| 42 | ||||||||
| 43 | CNAME alone is insufficient — anyone could temporarily point a CNAME. The TXT prefix `_robotwtf-verify` avoids collision with other TXT records. |
|||||||
| 44 | ||||||||
| 45 | Verification uses `dnspython` (new dependency). Flow: |
|||||||
| 46 | 1. User enters domain in settings UI → server generates token, stores as `pending` |
|||||||
| 47 | 2. UI shows required DNS records |
|||||||
| 48 | 3. User clicks "Verify" → server checks both CNAME and TXT |
|||||||
| 49 | 4. Both pass → status becomes `active` |
|||||||
| 50 | ||||||||
| 51 | Periodic re-verification (cron) to detect removed CNAME records is desirable but not required for v1. |
|||||||
| 52 | ||||||||
| 53 | ## TLS (Caddy) |
|||||||
| 54 | ||||||||
| 55 | Caddy's on-demand TLS with the existing `ask` endpoint handles this. Modify `/api/internal/check-slug` to also accept custom domains: |
|||||||
| 56 | ||||||||
| 57 | 1. If domain ends with `.{PLATFORM_DOMAIN}`, do existing slug lookup |
|||||||
| 58 | 2. Otherwise, look up domain in `custom_domains` where `verification_status = 'active'` |
|||||||
| 59 | 3. Return 200 if found, 404 if not |
|||||||
| 60 | ||||||||
| 61 | Caddy automatically obtains Let's Encrypt certificates for any domain that passes the ask check. No Caddyfile changes beyond ensuring the on-demand TLS block is configured (may already be). |
|||||||
| 62 | ||||||||
| 63 | ## Tenant Resolution |
|||||||
| 64 | ||||||||
| 65 | `TenantResolver.__call__()` gains a fallback path: |
|||||||
| 66 | ||||||||
| 67 | 1. Try `_parse_host(host)` as today → returns slug for `{slug}.robot.wtf` |
|||||||
| 68 | 2. If None, look up host in `custom_domains` where status is `active` |
|||||||
| 69 | 3. If found, use the associated `wiki_slug` |
|||||||
| 70 | 4. If neither, 404 |
|||||||
| 71 | ||||||||
| 72 | Performance: in-memory cache (`{domain: slug}` dict) with 60-second TTL. Invalidated on domain add/remove. Multiple gunicorn workers each maintain their own cache — short TTL makes this acceptable. |
|||||||
| 73 | ||||||||
| 74 | Set `environ['CUSTOM_DOMAIN'] = domain` when serving via custom domain so downstream code (auth, link generation) can detect it. |
|||||||
| 75 | ||||||||
| 76 | ## Authentication on Custom Domains |
|||||||
| 77 | ||||||||
| 78 | This is the hard part. The `platform_token` cookie is set on `.robot.wtf` and won't be sent to `wiki.example.com`. |
|||||||
| 79 | ||||||||
| 80 | ### Solution: Redirect-based auth relay |
|||||||
| 81 | ||||||||
| 82 | Standard pattern used by GitHub Pages, Notion, etc. |
|||||||
| 83 | ||||||||
| 84 | 1. Unauthenticated user visits `wiki.example.com` |
|||||||
| 85 | 2. Wiki requires auth → redirect to `https://robot.wtf/auth/login?return_to=https://wiki.example.com/...` |
|||||||
| 86 | 3. User authenticates on `robot.wtf` (cookie set on `.robot.wtf`) |
|||||||
| 87 | 4. Auth callback detects `return_to` is a custom domain |
|||||||
| 88 | 5. Generates a **relay token**: signed JWT with user claims, `domain` claim, 60-second expiry, single-use nonce |
|||||||
| 89 | 6. Redirects to `https://wiki.example.com/_auth/relay?token={relay_token}` |
|||||||
| 90 | 7. `/_auth/relay` handler validates token (signature, expiry, domain match, nonce), sets `platform_token` cookie scoped to `wiki.example.com`, redirects to original page |
|||||||
| 91 | ||||||||
| 92 | ### Auth changes required |
|||||||
| 93 | ||||||||
| 94 | - `_is_safe_return_url()` must accept verified custom domains (query `custom_domains`) |
|||||||
| 95 | - Auth callback generates relay token when `return_to` is a custom domain |
|||||||
| 96 | - New `/_auth/relay` route in resolver (or dedicated handler) |
|||||||
| 97 | - `TenantResolver._resolve_auth()` checks domain-scoped cookie (same name `platform_token`, browser sends the right one based on domain) |
|||||||
| 98 | ||||||||
| 99 | ### Relay token security |
|||||||
| 100 | ||||||||
| 101 | - Signed with the platform's RSA key (same as `PlatformJWT`) |
|||||||
| 102 | - 60-second expiry |
|||||||
| 103 | - Single-use: nonce stored in DB, consumed on use |
|||||||
| 104 | - Domain-bound: `domain` claim must match the request's Host header |
|||||||
| 105 | - No open redirect: final redirect path embedded in token, validated |
|||||||
| 106 | ||||||||
| 107 | ### Logout |
|||||||
| 108 | ||||||||
| 109 | Logging out on `robot.wtf` clears the `.robot.wtf` cookie but not the `wiki.example.com` cookie. Mitigation: set custom domain cookies with a 1-hour max-age (vs 24h for the platform cookie). Stale sessions are short-lived. |
|||||||
| 110 | ||||||||
| 111 | ## Management UI |
|||||||
| 112 | ||||||||
| 113 | "Custom Domain" card on `wiki_settings.html`: |
|||||||
| 114 | ||||||||
| 115 | **No domain configured:** |
|||||||
| 116 | - Text input + "Add Domain" button |
|||||||
| 117 | ||||||||
| 118 | **Pending verification:** |
|||||||
| 119 | - Show required DNS records (copyable) |
|||||||
| 120 | - "Check DNS" button |
|||||||
| 121 | - "Remove" button |
|||||||
| 122 | ||||||||
| 123 | **Active:** |
|||||||
| 124 | - Domain with green status badge |
|||||||
| 125 | - "Remove" button |
|||||||
| 126 | ||||||||
| 127 | Backend routes: |
|||||||
| 128 | - `POST /app/wiki/<slug>/domain` — add |
|||||||
| 129 | - `POST /app/wiki/<slug>/domain/verify` — check DNS |
|||||||
| 130 | - `POST /app/wiki/<slug>/domain/remove` — remove |
|||||||
| 131 | ||||||||
| 132 | ## Implementation Phases |
|||||||
| 133 | ||||||||
| 134 | 1. Schema + `CustomDomainModel` + DNS verification logic + tests |
|||||||
| 135 | 2. Modify `check-slug` endpoint for Caddy integration |
|||||||
| 136 | 3. Resolver custom domain lookup + cache |
|||||||
| 137 | 4. Auth relay (hardest phase) |
|||||||
| 138 | 5. Management UI |
|||||||
| 139 | ||||||||
| 140 | ## Risks |
|||||||
| 141 | ||||||||
| 142 | - **Auth relay is a new attack surface.** Must be cryptographically signed, time-limited, single-use, domain-bound. |
|||||||
| 143 | - **DNS propagation delays.** Users may verify before records propagate. UI should explain this and allow re-checking. |
|||||||
| 144 | - **Let's Encrypt rate limits.** 50 certs per registered domain per week. Unlikely at current scale. |
|||||||
| 145 | - **Cache invalidation across workers.** Short TTL (60s) is the simplest correct approach. |
|||||||
