Blame

6a6f0e Claude (MCP) 2026-03-19 05:14:02
[mcp] Add custom domain design document
1
---
2
category: design
3
tags: [infrastructure, caddy, auth, multi-tenant]
4
last_updated: 2026-03-19
5
confidence: medium
6
---
7
8
# Custom Domains
9
10
Allow users to serve their wiki from a domain they own (e.g., `wiki.example.com`) instead of `{slug}.robot.wtf`.
11
12
## Scope
13
14
- Subdomains only for v1 (e.g., `wiki.example.com`). Apex domains (`example.com`) require ALIAS/ANAME records which are provider-dependent and not universally supported.
15
- One custom domain per wiki. The schema supports multiple, but the UI enforces one. Can relax later.
16
- MCP works unchanged through custom domains (bearer token auth, no cookie dependency).
17
18
## Database Schema
19
20
New `custom_domains` table in `robot.db`:
21
22
```sql
23
CREATE TABLE IF NOT EXISTS custom_domains (
24
domain TEXT PRIMARY KEY,
25
wiki_slug TEXT NOT NULL REFERENCES wikis(slug) ON DELETE CASCADE,
26
verification_status TEXT NOT NULL DEFAULT 'pending', -- pending | verified | active
27
verification_token TEXT NOT NULL,
28
verified_at TEXT,
29
created_at TEXT NOT NULL
30
);
31
CREATE INDEX IF NOT EXISTS ix_custom_domains_slug ON custom_domains(wiki_slug);
32
```
33
34
Separate table (not a column on `wikis`) because domain verification has its own lifecycle and metadata.
35
36
## DNS Verification
37
38
User must create two DNS records:
39
40
1. **CNAME**: `wiki.example.com CNAME {slug}.robot.wtf.` (routes traffic)
41
2. **TXT**: `_robotwtf-verify.wiki.example.com TXT "robotwtf-verify={verification_token}"` (proves ownership)
42
43
CNAME alone is insufficient — anyone could temporarily point a CNAME. The TXT prefix `_robotwtf-verify` avoids collision with other TXT records.
44
45
Verification uses `dnspython` (new dependency). Flow:
46
1. User enters domain in settings UI → server generates token, stores as `pending`
47
2. UI shows required DNS records
48
3. User clicks "Verify" → server checks both CNAME and TXT
49
4. Both pass → status becomes `active`
50
51
Periodic re-verification (cron) to detect removed CNAME records is desirable but not required for v1.
52
53
## TLS (Caddy)
54
55
Caddy's on-demand TLS with the existing `ask` endpoint handles this. Modify `/api/internal/check-slug` to also accept custom domains:
56
57
1. If domain ends with `.{PLATFORM_DOMAIN}`, do existing slug lookup
58
2. Otherwise, look up domain in `custom_domains` where `verification_status = 'active'`
59
3. Return 200 if found, 404 if not
60
61
Caddy automatically obtains Let's Encrypt certificates for any domain that passes the ask check. No Caddyfile changes beyond ensuring the on-demand TLS block is configured (may already be).
62
63
## Tenant Resolution
64
65
`TenantResolver.__call__()` gains a fallback path:
66
67
1. Try `_parse_host(host)` as today → returns slug for `{slug}.robot.wtf`
68
2. If None, look up host in `custom_domains` where status is `active`
69
3. If found, use the associated `wiki_slug`
70
4. If neither, 404
71
72
Performance: in-memory cache (`{domain: slug}` dict) with 60-second TTL. Invalidated on domain add/remove. Multiple gunicorn workers each maintain their own cache — short TTL makes this acceptable.
73
74
Set `environ['CUSTOM_DOMAIN'] = domain` when serving via custom domain so downstream code (auth, link generation) can detect it.
75
76
## Authentication on Custom Domains
77
78
This is the hard part. The `platform_token` cookie is set on `.robot.wtf` and won't be sent to `wiki.example.com`.
79
80
### Solution: Redirect-based auth relay
81
82
Standard pattern used by GitHub Pages, Notion, etc.
83
84
1. Unauthenticated user visits `wiki.example.com`
85
2. Wiki requires auth → redirect to `https://robot.wtf/auth/login?return_to=https://wiki.example.com/...`
86
3. User authenticates on `robot.wtf` (cookie set on `.robot.wtf`)
87
4. Auth callback detects `return_to` is a custom domain
88
5. Generates a **relay token**: signed JWT with user claims, `domain` claim, 60-second expiry, single-use nonce
89
6. Redirects to `https://wiki.example.com/_auth/relay?token={relay_token}`
90
7. `/_auth/relay` handler validates token (signature, expiry, domain match, nonce), sets `platform_token` cookie scoped to `wiki.example.com`, redirects to original page
91
92
### Auth changes required
93
94
- `_is_safe_return_url()` must accept verified custom domains (query `custom_domains`)
95
- Auth callback generates relay token when `return_to` is a custom domain
96
- New `/_auth/relay` route in resolver (or dedicated handler)
97
- `TenantResolver._resolve_auth()` checks domain-scoped cookie (same name `platform_token`, browser sends the right one based on domain)
98
99
### Relay token security
100
101
- Signed with the platform's RSA key (same as `PlatformJWT`)
102
- 60-second expiry
103
- Single-use: nonce stored in DB, consumed on use
104
- Domain-bound: `domain` claim must match the request's Host header
105
- No open redirect: final redirect path embedded in token, validated
106
107
### Logout
108
109
Logging out on `robot.wtf` clears the `.robot.wtf` cookie but not the `wiki.example.com` cookie. Mitigation: set custom domain cookies with a 1-hour max-age (vs 24h for the platform cookie). Stale sessions are short-lived.
110
111
## Management UI
112
113
"Custom Domain" card on `wiki_settings.html`:
114
115
**No domain configured:**
116
- Text input + "Add Domain" button
117
118
**Pending verification:**
119
- Show required DNS records (copyable)
120
- "Check DNS" button
121
- "Remove" button
122
123
**Active:**
124
- Domain with green status badge
125
- "Remove" button
126
127
Backend routes:
128
- `POST /app/wiki/<slug>/domain` — add
129
- `POST /app/wiki/<slug>/domain/verify` — check DNS
130
- `POST /app/wiki/<slug>/domain/remove` — remove
131
132
## Implementation Phases
133
134
1. Schema + `CustomDomainModel` + DNS verification logic + tests
135
2. Modify `check-slug` endpoint for Caddy integration
136
3. Resolver custom domain lookup + cache
137
4. Auth relay (hardest phase)
138
5. Management UI
139
140
## Risks
141
142
- **Auth relay is a new attack surface.** Must be cryptographically signed, time-limited, single-use, domain-bound.
143
- **DNS propagation delays.** Users may verify before records propagate. UI should explain this and allow re-checking.
144
- **Let's Encrypt rate limits.** 50 certs per registered domain per week. Unlikely at current scale.
145
- **Cache invalidation across workers.** Short TTL (60s) is the simplest correct approach.