Properties

category: reference
tags: [design, prd, architecture, atproto, vps]
last_updated: 2026-03-14
confidence: medium

VPS Architecture (ATProto + robot.wtf)

Status: Active — this is the current plan Supersedes: Design/Platform_Overview, Design/Auth, Design/Operations, Design/Data_Model (infrastructure and billing sections), Design/Implementation_Phases (phase structure and premium tiers) Preserves: ACL model, permission headers, MCP tools, Otterwiki multi-tenancy middleware, URL structure, semantic search logic, wiki bootstrap template, REST API surface

Why this exists

The AWS serverless architecture described in Design/Platform_Overview works, but it optimizes for a problem we don't have: elastic scale and zero cost at rest. The tradeoff is complexity — VPC endpoints, Mangum adapters, DynamoDB Streams to avoid SQS endpoint costs, Lambda cold starts, EFS mount latency. All of that machinery exists to make Lambda work, not to make the wiki work.

robot.wtf is a free, volunteer-run wiki service for the ATProto community. No premium tier, no billing, no Stripe. The hosting is a Debian 12 VM on a Proxmox hypervisor with a static IP and generous RAM and disk. The deployment is conventional: persistent processes, local disk, SQLite, Caddy. The application logic — multi-tenant Otterwiki, MCP tools, semantic search, ACL enforcement — ports over from the Lambda implementation with minimal changes. The middleware we already built for Lambda is WSGI middleware with a Mangum wrapper; removing the wrapper gives us back the WSGI middleware.

The ATProto identity system replaces WorkOS as the auth provider. Users sign in with their Bluesky handle (or any ATProto PDS account). Identity is a DID — portable, user-owned, and philosophically aligned with "your wiki is a git repo you can clone." The target audience (developers and researchers using AI agents) overlaps heavily with the ATProto early-adopter community.

Service model

robot.wtf is a free tool, not a business. There is no premium tier and no billing infrastructure.

Every user gets:

1 wiki
500 pages
3 collaborators
Full-text search + semantic search
MCP access (Claude.ai OAuth + Claude Code bearer token)
Read-only git clone
Public wiki toggle

These are resource management limits, not a paywall. If someone needs more, they clone their repo and self-host — which is the whole point of git-backed storage.

If paid tiers ever make sense, the architecture supports them — the ACL model and schema have room for tier fields. But the billing infrastructure (Stripe, webhooks, lapse enforcement, upgrade/downgrade flows) doesn't get built until someone is actually asking to pay. That decision and all the commercial design work is preserved in the archived design docs (Design/Implementation_Phases, Design/Operations).

Infrastructure

Server

Debian 12 VM running on a Proxmox hypervisor. Static IP, generous RAM and disk allocation. If the VM ever needs to move, the deployment is portable to any Linux box (Hetzner, DigitalOcean, Fly.io, bare metal, or back to AWS on an EC2 instance) — nothing is host-specific.

Process model

Four persistent processes, managed by systemd or Docker Compose:

┌─────────────────────────────────────────────────────────────────┐
│  Caddy (reverse proxy, TLS)                                     │
│  *.robot.wtf + robot.wtf                                        │
│                                                                 │
│  Routes:                                                        │
│    {slug}.robot.wtf/mcp          → MCP sidecar (port 8001)      │
│    {slug}.robot.wtf/api/v1/*     → REST API (port 8002)         │
│    {slug}.robot.wtf/repo.git/*   → Git smart HTTP (port 8002)   │
│    {slug}.robot.wtf/*            → Otterwiki WSGI (port 8000)   │
│    robot.wtf/auth/*              → Auth service (port 8003)     │
│    robot.wtf/api/*               → Management API (port 8002)   │
│    robot.wtf/app/*               → Static files (SPA)           │
│    robot.wtf                     → Static files (landing page)  │
└────────┬──────────┬──────────┬──────────┬───────────────────────┘
         │          │          │          │
    ┌────▼───┐ ┌────▼───┐ ┌───▼────┐ ┌──▼─────┐
    │Otterwiki│ │  MCP   │ │Platform│ │  Auth  │
    │ WSGI   │ │sidecar │ │  API   │ │service │
    │Gunicorn│ │FastMCP │ │ Flask  │ │ Flask  │
    │:8000   │ │:8001   │ │:8002   │ │:8003   │
    └────┬───┘ └───┬────┘ └───┬────┘ └───┬────┘
         │         │          │           │
    ┌────▼─────────▼──────────▼───────────▼───┐
    │  Shared resources                        │
    │  /srv/wikis/{slug}/repo.git   (git)      │
    │  /srv/wikis/{slug}/index.faiss (vectors) │
    │  /srv/data/robot.db           (SQLite)   │
    │  /srv/data/embeddings/        (model)    │
    └─────────────────────────────────────────┘

Caddy

Caddy handles TLS termination, automatic Let's Encrypt certificates (including wildcard via DNS challenge), and reverse proxy routing. It replaces API Gateway + CloudFront + ACM.

Wildcard TLS requires a DNS challenge. Caddy supports this natively with plugins for common DNS providers (Cloudflare, Route 53, OVHcloud). The DNS zone for robot.wtf needs API credentials configured in Caddy.

Caddy's routing is order-sensitive and matcher-based. The Caddyfile structure:

robot.wtf {
    handle /auth/* {
        reverse_proxy localhost:8003
    }
    handle /api/* {
        reverse_proxy localhost:8002
    }
    handle /app/* {
        root * /srv/static/app
        try_files {path} /app/index.html
        file_server
    }
    handle {
        root * /srv/static/landing
        file_server
    }
}

*.robot.wtf {
    @mcp path /mcp /mcp/*
    handle @mcp {
        reverse_proxy localhost:8001
    }

    @api path /api/v1/*
    handle @api {
        reverse_proxy localhost:8002
    }

    @git path /repo.git/*
    handle @git {
        reverse_proxy localhost:8002
    }

    handle {
        reverse_proxy localhost:8000
    }
}

The {slug} is extracted from the Host header by the downstream services, not by Caddy. Caddy just routes to the right backend; the backend resolves the tenant.

Why not Nginx

Caddy's automatic TLS (including wildcard via DNS challenge) eliminates certbot, cron renewal, and manual certificate management. For a single-operator deployment where the admin might not be around to fix a cert renewal failure, this matters. Nginx is more configurable but requires more maintenance. If we needed fine-grained caching rules or complex rewrite logic, Nginx would be worth the tradeoff. We don't.

Authentication

Identity model

User identity is an ATProto DID (Decentralized Identifier). A DID is a persistent, portable identifier that survives handle changes and PDS migrations. When a user logs in, we resolve their handle to a DID and store the DID as the primary key.

User {
  did: string,                 // e.g. "did:plc:abc123..." — primary identifier
  handle: string,              // e.g. "sderle.bsky.social" — display name, may change
  display_name: string,        // from ATProto profile
  avatar_url?: string,         // from ATProto profile
  username: string,            // platform username (URL slug)
  created_at: ISO8601,
  wiki_count: number,
}

The did is the stable identity. The handle is refreshed from the PDS on each login (handles can change). The username is the platform-local slug used in URLs — immutable after signup.

Username defaulting

When a new user signs up, the platform username defaults to the local part of their ATProto handle. For a handle like sderle.bsky.social, the default is sderle. For a user with a custom domain handle like schuyler.robot.wtf, the default is schuyler (the domain prefix). The user can override this at signup if they want something different, but the default should be right most of the time.

Validation rules are unchanged from the original design: lowercase alphanumeric + hyphens, 3–30 characters, no leading/trailing hyphens, checked against the reserved name blocklist and existing usernames/wiki slugs.

robot.wtf is an ATProto OAuth confidential client. The flow:

1. User enters their handle (e.g. "sderle.bsky.social") on the login page
2. robot.wtf resolves the handle to a DID, then resolves the DID to a PDS URL
3. robot.wtf fetches the PDS's Authorization Server metadata
   (GET {pds}/.well-known/oauth-authorization-server)
4. robot.wtf sends a Pushed Authorization Request (PAR) to the PDS's AS,
   including PKCE code_challenge and DPoP proof
5. User is redirected to their PDS's authorization interface
6. User approves the authorization request
7. PDS redirects back to robot.wtf/auth/callback with an authorization code
8. robot.wtf exchanges the code for tokens (access_token + refresh_token)
   with DPoP binding and client authentication (signed JWT)
9. robot.wtf uses the access token to fetch the user's profile (DID, handle,
   display name) from their PDS
10. robot.wtf mints a platform JWT, sets it as an HttpOnly cookie on .robot.wtf
11. Redirect to robot.wtf/app/

The platform JWT is signed with our own RS256 key (stored on disk). After step 10, the PDS is not in the runtime path — the platform JWT is self-contained and validated locally. ATProto tokens are stored in the session database for potential future use (e.g., posting to Bluesky on behalf of the user), but they're not needed for wiki operations.

Reference implementation

Bluesky maintains a Python Flask OAuth demo in bluesky-social/cookbook/python-oauth-web-app (CC-0 licensed). It implements the full ATProto OAuth flow as a confidential client using authlib for PKCE, DPoP, JWK/JWT, and code challenge. This is the starting point for our auth service. It handles the hard parts: handle-to-DID resolution, PDS Authorization Server discovery, PAR, DPoP nonce management, and token refresh. See Dev/V3_V5_Risk_Research for detailed assessment.

Key libraries from the reference implementation:

authlib>=1.3 — PKCE, JWK/JWT, DPoP proof creation, code challenge
dnspython>=2.6 — DNS TXT lookups for handle resolution
requests>=2.32 + requests-hardened>=1.0.0b3 — HTTP client with SSRF mitigations

MCP OAuth (Claude.ai)

This is the most architecturally significant auth flow. Claude.ai's MCP client implements standard OAuth 2.1 with Dynamic Client Registration (DCR). It discovers the Authorization Server by fetching /.well-known/oauth-protected-resource from the MCP endpoint. The AS must support DCR, PKCE, and standard token endpoints.

ATProto's OAuth profile is not directly compatible with this — ATProto uses per-user Authorization Servers (each user's PDS), whereas Claude.ai expects a single AS URL from the resource metadata endpoint.

Solution: robot.wtf runs its own OAuth 2.1 Authorization Server for MCP.

1. Claude.ai connects to https://{slug}.robot.wtf/mcp
2. Gets 401, fetches /.well-known/oauth-protected-resource
3. Discovers robot.wtf's AS at https://robot.wtf/auth/oauth
4. Performs Dynamic Client Registration at robot.wtf/auth/oauth/register
5. Redirects user to robot.wtf/auth/oauth/authorize
6. User sees robot.wtf's consent page:
   - If already logged in (platform JWT cookie): "Authorize Claude to access {wiki}?"
   - If not logged in: "Sign in with Bluesky" → ATProto OAuth flow → then consent
7. User approves, robot.wtf issues authorization code
8. Claude.ai exchanges code for access token at robot.wtf/auth/oauth/token
9. Claude.ai uses access token to make MCP requests
10. MCP sidecar validates token against robot.wtf's JWKS

robot.wtf's MCP OAuth AS is a thin layer. It delegates authentication to ATProto (step 6) and handles authorization itself (does this user have access to this wiki?). The token it issues is a JWT containing the user's DID and the authorized wiki slug, signed with our RS256 key.

Required OAuth 2.1 AS endpoints:

Endpoint	Purpose
`/.well-known/oauth-authorization-server`	AS metadata (issuer, endpoints, supported grants)
`/auth/oauth/register`	Dynamic Client Registration (RFC 7591)
`/auth/oauth/authorize`	Authorization endpoint (consent page)
`/auth/oauth/token`	Token endpoint (code exchange, refresh)
`/.well-known/jwks.json`	Public key for token validation

These can be implemented with authlib's server components or hand-rolled (the spec surface is small — DCR, authorization code grant with PKCE, token issuance, JWKS).

MCP protected resource metadata

Each wiki's MCP endpoint serves its own resource metadata:

// GET https://{slug}.robot.wtf/.well-known/oauth-protected-resource
{
  "resource": "https://{slug}.robot.wtf/mcp",
  "authorization_servers": ["https://robot.wtf/auth/oauth"],
  "scopes_supported": ["wiki:read", "wiki:write"]
}

All wikis point to the same AS. The AS knows which wiki is being authorized because the redirect_uri and resource parameter identify the wiki.

Bearer tokens (Claude Code / API)

Unchanged from the current design. Each wiki gets a bearer token at creation time, stored as a bcrypt hash in the database. The user sees the token once. Claude Code usage:

claude mcp add {slug} \
  --transport http \
  --url https://{slug}.robot.wtf/mcp \
  --header "Authorization: Bearer YOUR_TOKEN"

Cross-subdomain auth

Same approach as Design/Frontend: platform JWT stored as an HttpOnly, Secure, SameSite=Lax cookie on .robot.wtf. Every request to any subdomain includes the cookie. The Otterwiki middleware and MCP sidecar both validate JWTs using the same public key.

Auth convergence

All three paths converge on the same identity and the same ACL check:

Browser       → ATProto OAuth → platform JWT (cookie)  → resolve DID → ACL check
Claude.ai     → MCP OAuth 2.1 → MCP access token (JWT) → resolve DID → ACL check
Claude Code   → Bearer token  → hash lookup in DB       → resolve user → ACL check

All paths → middleware → sets Otterwiki proxy headers (or authorizes MCP/API request)

Migration off ATProto

We store the DID as the primary user identifier, not the handle or PDS URL. If ATProto auth needs to be replaced, the migration path is:

Add alternative OAuth providers (Google, GitHub) alongside ATProto
Link new provider identities to existing DIDs via an identity_links table
Existing users continue to work; new users can sign up with either method

This is simpler than the WorkOS migration path in the original design because we already own the JWT-issuing layer — we're not migrating off a third-party token issuer.

Data Model

SQLite replaces DynamoDB

The dataset is small even at 1000 users. SQLite on local disk is simpler, faster, and free. The application layer uses SQLAlchemy (or raw sqlite3 — the schema is simple enough). If the deployment ever needs Postgres, the migration is straightforward.

The SQLite database lives at /srv/data/robot.db. Write concurrency is handled by SQLite's WAL mode, which supports concurrent reads with serialized writes. For a wiki platform where writes are infrequent relative to reads, this is more than adequate.

Tables

CREATE TABLE users (
    did TEXT PRIMARY KEY,              -- ATProto DID
    handle TEXT NOT NULL,              -- ATProto handle (may change)
    display_name TEXT,
    avatar_url TEXT,
    username TEXT UNIQUE NOT NULL,     -- platform slug, immutable
    created_at TEXT NOT NULL,          -- ISO8601
    wiki_count INTEGER DEFAULT 0
);

CREATE TABLE wikis (
    slug TEXT PRIMARY KEY,             -- globally unique, URL slug
    owner_did TEXT NOT NULL REFERENCES users(did),
    display_name TEXT NOT NULL,
    repo_path TEXT NOT NULL,           -- /srv/wikis/{slug}/repo.git
    mcp_token_hash TEXT NOT NULL,      -- bcrypt hash
    is_public INTEGER DEFAULT 0,
    created_at TEXT NOT NULL,
    last_accessed TEXT NOT NULL,
    page_count INTEGER DEFAULT 0
);

CREATE TABLE acls (
    wiki_slug TEXT NOT NULL REFERENCES wikis(slug),
    grantee_did TEXT NOT NULL REFERENCES users(did),
    role TEXT NOT NULL,                -- 'owner' | 'editor' | 'viewer'
    granted_by TEXT NOT NULL,
    granted_at TEXT NOT NULL,
    PRIMARY KEY (wiki_slug, grantee_did)
);

CREATE TABLE oauth_sessions (
    id TEXT PRIMARY KEY,               -- session ID
    user_did TEXT NOT NULL REFERENCES users(did),
    dpop_private_jwk TEXT NOT NULL,    -- DPoP key (encrypted at rest)
    access_token TEXT,
    refresh_token TEXT,
    token_expires_at TEXT,
    created_at TEXT NOT NULL
);

CREATE TABLE mcp_oauth_clients (
    client_id TEXT PRIMARY KEY,        -- DCR-issued client ID
    client_name TEXT,
    redirect_uris TEXT NOT NULL,       -- JSON array
    client_secret_hash TEXT,           -- for confidential clients
    created_at TEXT NOT NULL
);

CREATE TABLE reindex_queue (
    wiki_slug TEXT NOT NULL,
    page_path TEXT NOT NULL,
    action TEXT NOT NULL,              -- 'upsert' | 'delete'
    queued_at TEXT NOT NULL,
    PRIMARY KEY (wiki_slug, page_path)
);

Storage layout

/srv/
  wikis/
    {slug}/
      repo.git/              # bare git repo
      index.faiss            # FAISS vector index
      embeddings.json        # page_path → vector mapping
  data/
    robot.db                 # SQLite database
    signing_key.pem          # RS256 private key for JWT signing
    signing_key.pub          # RS256 public key
    client_jwk.json          # ATProto OAuth confidential client JWK (private)
    client_jwk_pub.json      # ATProto OAuth client JWK (public, served at client_id URL)
  static/
    landing/                 # landing page HTML/CSS/JS
    app/                     # management SPA
  embeddings/
    model/                   # all-MiniLM-L6-v2 model files
  backups/                   # local backup staging

Compute

Otterwiki (WSGI)

Otterwiki runs as a persistent Gunicorn process. The multi-tenant middleware we built for Lambda ports back to WSGI by removing the Mangum wrapper. The middleware:

Extracts the wiki slug from the Host header
Looks up the wiki in SQLite
Resolves the user from the platform JWT (cookie) or bearer token
Checks ACL permissions
Sets Otterwiki proxy headers (x-otterwiki-email, x-otterwiki-name, x-otterwiki-permissions)
Swaps Otterwiki's config to point at the correct repo path
Delegates to Otterwiki's Flask app

The config-swapping is the multi-tenancy mechanism we already built. In Lambda, it happened per-invocation; in WSGI, it happens per-request. The difference is negligible — the config is a handful of in-memory variables, not file I/O.

Gunicorn runs with multiple workers. The Proxmox VM has generous RAM, so worker count is limited by CPU cores, not memory. Git write operations are serialized per-repo by git's own lock file.

MCP sidecar (FastMCP)

FastMCP runs as a separate process serving Streamable HTTP on port 8001. It reads git repos directly from /srv/wikis/{slug}/repo.git — same code as the current MCP server, same tools, same return formats.

The sidecar validates MCP OAuth tokens (JWTs signed by our AS) and bearer tokens (bcrypt hash lookup in SQLite). Token validation is the same logic as the Otterwiki middleware, factored into a shared library.

Why a separate process: Otterwiki is a Flask app designed around page rendering. The MCP server is an async protocol handler. Mixing them in one process would require either making Otterwiki async (large refactor) or running FastMCP synchronously (defeats the purpose). Separate processes, same database, same git repos.

Platform API (Flask)

A lightweight Flask app handling the management API (wiki CRUD, ACL management, token generation) and the Git smart HTTP protocol. This is the same API surface described in the archived Design/Implementation_Phases, with SQLite queries instead of DynamoDB calls.

The Git smart HTTP endpoints (/repo.git/info/refs, /repo.git/git-upload-pack) use dulwich to serve the bare repos on disk. Read-only (upload-pack only) — users can clone and pull their wikis at any time.

Auth service (Flask)

Handles both ATProto OAuth (browser login) and the MCP OAuth 2.1 AS. Runs as its own process because the OAuth flows involve redirects and state management that are cleaner in isolation.

This could be merged into the platform API process. Separating it keeps the auth code (which is security-critical and relatively complex) isolated from the CRUD endpoints. If the separation proves to be operationally annoying, merge them — they're both Flask apps talking to the same SQLite database.

Semantic Search

The embedding pipeline simplifies dramatically on a VPS. No DynamoDB Streams, no event source mappings, no separate embedding Lambda. MiniLM loads once at process startup and stays in memory.

Write path

Page write (Otterwiki or MCP)
  → Middleware writes {wiki_slug, page_path, action} to reindex_queue table in SQLite
  → Background worker (in-process thread or separate process) polls the queue:
      1. Read page content from git repo on disk
      2. Chunk page
      3. Embed chunks using MiniLM (already loaded in memory)
      4. Update FAISS index on disk
      5. Delete queue entry

The background worker can be a simple thread in the Otterwiki process (using Python's threading or concurrent.futures), a separate huey or rq worker, or even a cron job that runs every 30 seconds. The latency requirement is loose — research wikis are written by AI agents and searched minutes later.

For simplicity, start with an in-process thread pool. If it causes issues (GIL contention under load, memory pressure from MiniLM in every Gunicorn worker), move to a dedicated worker process that loads MiniLM once and processes the queue.

Search path

Synchronous, handled by the MCP sidecar or REST API:

MiniLM is loaded at process startup (the MCP sidecar and API processes both load it)
Embed the query
Load FAISS index from disk (cached in memory after first load)
Search, deduplicate, return results

On a VPS, loading the FAISS index is a local disk read (<1ms for a typical wiki). No EFS mount latency, no Lambda cold start loading the model.

Model loading strategy

MiniLM (~80MB) loads in ~500ms. On a VPS with persistent processes, this happens once at startup. In the Lambda architecture, it happened on every cold start. This is one of the clearest wins of the VPS approach.

The Proxmox VM has plenty of RAM, so loading MiniLM in both the MCP sidecar and a dedicated embedding worker is fine. The Otterwiki process and platform API don't need it — they just write to the reindex queue.

Backup and Disaster Recovery

What we're protecting

Data	Location	Severity of loss
Git repos	`/srv/wikis/*/repo.git`	Critical — user data
SQLite database	`/srv/data/robot.db`	High — reconstructable from repos but painful
FAISS indexes	`/srv/wikis/*/index.faiss`	Low — rebuildable from repo content
Signing keys	`/srv/data/.pem`, `/srv/data/.json`	High — loss invalidates all active sessions

Backup strategy

Git repos: rsync to offsite storage (a second VPS, an S3 bucket, or a Backblaze B2 bucket). Daily, with a cron job. Repos are bare git — rsync handles them efficiently. Also: users can git clone their own repos at any time, which is distributed backup by design.

SQLite: .backup command (online backup, doesn't block writes in WAL mode) to a local snapshot file, then rsync offsite with the git repos. Daily.

Signing keys: Backed up once at creation time, stored separately from the data backups (e.g., in a password manager or encrypted at rest on a different system). These rarely change.

FAISS indexes: Not backed up. Rebuildable from repo content. Loss triggers a one-time re-embedding — seconds per wiki.

Proxmox snapshots: The Proxmox hypervisor can take VM-level snapshots. These are a useful complement to application-level backups — a snapshot captures the entire VM state for rapid rollback after a bad deploy. Not a substitute for offsite backups (snapshots live on the same hardware).

Recovery

If the VM dies completely, recovery is:

Provision a new VM (on Proxmox or any other host)
Install Debian 12, install dependencies, deploy application code
Restore signing keys
Restore SQLite database from backup
Restore git repos from backup (or users re-push from their clones)
Re-embed all wikis (automated script, runs in minutes)
Update DNS to point to new IP (if it changed)

RTO: hours (mostly limited by repo restore transfer time). RPO: 24 hours (daily backup cycle). This is acceptable for a free community service. If tighter RPO is needed, increase backup frequency or add streaming replication to a standby.

Deployment

Application deployment

Code lives in a Git repo. Deployment is git pull + restart services. No Pulumi, no CloudFormation, no CI/CD pipeline required (though one can be added).

ssh vm
cd /srv/app
git pull
pip install -r requirements.txt --break-system-packages
sudo systemctl restart robot-otterwiki
sudo systemctl restart robot-mcp
sudo systemctl restart robot-api
sudo systemctl restart robot-auth
# Caddy doesn't need restart for app deploys

Or with Docker Compose:

ssh vm
cd /srv/app
git pull
docker compose build
docker compose up -d

Initial setup

Provision Debian 12 VM on Proxmox, assign static IP
Install OS packages: Python 3.11+, git, build-essential
Install Caddy (with DNS challenge plugin for the DNS provider)
Configure DNS: robot.wtf and *.robot.wtf → VM's static IP
Configure Caddy with DNS challenge credentials for wildcard TLS
Generate RS256 signing keypair (/srv/data/signing_key.pem)
Generate ATProto OAuth client JWK (/srv/data/client_jwk.json)
Publish client metadata at https://robot.wtf/auth/client-metadata.json
Initialize SQLite database (run migration script)
Download MiniLM model to /srv/embeddings/model/
Start services, verify health checks

Monitoring

For a volunteer-run service, keep monitoring simple:

Health checks: Each service exposes a /health endpoint. An external monitor (UptimeRobot, free tier) pings them.
Logs: systemd journal or Docker logs. journalctl -u robot-otterwiki --since "1 hour ago" is sufficient at this scale.
Disk space: A cron job that alerts (email or Bluesky DM) when disk usage exceeds 80%.
Backups: The backup cron job logs success/failure. Alert on failure.

If the service grows, add Prometheus + Grafana. Not before.

URL Structure

Every wiki gets a subdomain: {slug}.robot.wtf. The slug is the wiki's globally unique identifier.

sderle.robot.wtf/                   → wiki web UI (Otterwiki)
sderle.robot.wtf/api/v1/            → wiki REST API
sderle.robot.wtf/mcp                → wiki MCP endpoint
sderle.robot.wtf/repo.git/*         → git smart HTTP (read-only clone)

For the single-wiki-per-user model, the wiki slug is the username. You sign up as sderle, your wiki lives at sderle.robot.wtf.

The management app and auth live on the root domain:

robot.wtf/                          → landing page
robot.wtf/app/                      → management SPA (dashboard)
robot.wtf/app/settings              → wiki settings
robot.wtf/app/collaborators         → collaborator management
robot.wtf/app/connect               → MCP connection instructions
robot.wtf/app/account               → account settings
robot.wtf/auth/*                    → OAuth flows (ATProto + MCP AS)
robot.wtf/api/*                     → management API

Namespace rules

Slugs and usernames are the same thing (each user gets one wiki, the slug IS the username). Reserved names blocked for signup: api, auth, app, www, admin, mcp, docs, status, blog, help, support, static, assets, null, undefined, wiki, robot.

What changes vs. what stays the same

Stays the same

ACL model (owner/editor/viewer roles, same permission matrix)
Otterwiki proxy header mechanism (x-otterwiki-email, x-otterwiki-name, x-otterwiki-permissions)
Multi-tenant middleware logic (resolve slug → look up wiki → check ACL → set headers → delegate)
MCP tools (read_note, write_note, search, semantic_search, list_notes, etc.)
REST API surface (same endpoints, same request/response shapes)
Wiki bootstrap template
FAISS + MiniLM semantic search
Otterwiki admin panel disposition (same sections hidden/shown)

Changes

Component	AWS architecture	VPS architecture
Domain	wikibot.io	robot.wtf
Business model	Freemium SaaS	Free volunteer project
Hosting	Lambda + EFS + API Gateway	Gunicorn + local disk + Caddy
Host environment	AWS (managed)	Debian 12 VM on Proxmox
Database	DynamoDB (on-demand)	SQLite (WAL mode)
Auth provider	WorkOS AuthKit	ATProto OAuth (self-hosted)
MCP OAuth AS	WorkOS (standalone connect)	Self-hosted OAuth 2.1 AS
Identity	OAuth provider sub (Google/GitHub/etc.)	ATProto DID
TLS	ACM + CloudFront	Caddy + Let's Encrypt
Embedding trigger	DynamoDB Streams → Lambda	SQLite queue → background worker
Static hosting	S3 + CloudFront	Caddy file_server
IaC	Pulumi	systemd units or Docker Compose
Secrets	Secrets Manager	Files on disk
Backups	AWS Backup + DynamoDB PITR	rsync + SQLite .backup + Proxmox snapshots
Billing	Stripe (planned)	None
Cost	~$13–18/mo at launch	$0

What can be reused from existing implementation

Multi-tenant middleware — remove Mangum wrapper, the WSGI middleware is underneath
MCP server tools — identical, just change the repo path prefix
REST API handlers — swap DynamoDB calls for SQLite queries
Otterwiki fork — identical, same proxy header auth mode
Semantic search plugin — identical
FAISS indexing code — identical
Frontend SPA — identical (change VITE_API_BASE_URL, remove WorkOS client ID)
Wiki bootstrap template — identical
ACL checking logic — swap DynamoDB reads for SQLite reads

Open Questions

~~ATProto Python OAuth library maturity.~~ RESOLVED. The Bluesky Flask demo uses authlib (not joserfc — earlier research was wrong). Dependencies are authlib>=1.3, dnspython, requests, requests-hardened, regex. All mature. The demo is ~600 lines, well-factored, and directly adaptable. See Dev/V3_V5_Risk_Research.
~~MCP OAuth AS scope.~~ RESOLVED. authlib provides AuthorizationServer, AuthorizationCodeGrant (with PKCE), and ClientRegistrationEndpoint (RFC 7591). The Flask OAuth 2.0 server components handle the heavy lifting. We implement model callbacks (save_client, save_token, query_client) against SQLite. See Dev/V3_V5_Risk_Research for implementation sketch.
Caddy DNS challenge provider. Wildcard TLS requires DNS API access. Which DNS provider hosts the robot.wtf zone? Cloudflare, Route 53, and OVHcloud are all supported by Caddy. The DNS provider choice should be made before deployment.
Claude.ai MCP OAuth compatibility. The self-hosted OAuth 2.1 AS approach should work — Claude.ai's MCP client follows standard OAuth 2.1 discovery. Key finding: Claude.ai uses client_secret_post auth method and does NOT require DPoP. The risk is in underdocumented client quirks. Mitigation: build a minimal stub AS early and test against Claude.ai before building the full thing. See Dev/V3_V5_Risk_Research.
~~ATProto scopes.~~ RESOLVED. The ATProto spec explicitly says: "A client may include only the atproto scope if they only need account authentication." The sub field in the token response contains the DID. We request scope "atproto" and nothing else. See Dev/V3_V5_Risk_Research.
Docker Compose vs. systemd. Both work. Docker Compose gives you reproducible builds, isolation, and easier migration between hosts. Systemd is lighter, native to Debian, and avoids Docker's overhead. For a Proxmox VM where we control the environment completely, systemd is probably sufficient. Docker adds value if we expect to move the deployment frequently.