Properties
category: reference
tags: [design, prd, architecture, atproto, vps]
last_updated: 2026-03-14
confidence: medium

VPS Architecture (ATProto + OVHcloud)

Status: Draft — proposed alternative to the AWS serverless architecture Replaces (if adopted): Design/Platform_Overview, Design/Auth, Design/Operations (infrastructure sections) Preserves: ACL model, permission headers, MCP tools, Otterwiki multi-tenancy middleware, URL structure, semantic search logic, wiki bootstrap template, REST API surface, freemium tiers


Why this exists

The AWS serverless architecture described in Design/Platform_Overview works, but it optimizes for a problem we may not have yet: elastic scale and zero cost at rest. The tradeoff is complexity — VPC endpoints, Mangum adapters, DynamoDB Streams to avoid SQS endpoint costs, Lambda cold starts, EFS mount latency. All of that machinery exists to make Lambda work, not to make the wiki work.

A VPS on an OVHcloud community server for ATProto apps eliminates the hosting bill entirely and replaces the AWS complexity with a conventional deployment: persistent processes, local disk, SQLite, Caddy. The application logic — multi-tenant Otterwiki, MCP tools, semantic search, ACL enforcement — ports over with minimal changes. The middleware we already built for Lambda is WSGI middleware with a Mangum wrapper; removing the wrapper gives us back the WSGI middleware.

The ATProto identity system replaces WorkOS as the auth provider. Users sign in with their Bluesky handle (or any ATProto PDS account). Identity is a DID — portable, user-owned, and philosophically aligned with "your wiki is a git repo you can clone." The target audience (developers and researchers using AI agents) overlaps heavily with the ATProto early-adopter community, and the OVHcloud community server is specifically for ATProto apps.


Infrastructure

Server

OVHcloud community VPS for ATProto applications. Shared infrastructure, zero cost. The VPS runs Linux with Docker or systemd-managed services. If we ever need to leave the community server, the deployment is portable to any VPS provider (Hetzner, DigitalOcean, Fly.io, or back to AWS on an EC2 instance) — nothing is OVHcloud-specific.

Process model

Four persistent processes, managed by systemd or Docker Compose:

┌─────────────────────────────────────────────────────────────────┐
│  Caddy (reverse proxy, TLS)                                     │
│  *.{domain} + {domain}                                          │
│                                                                 │
│  Routes:                                                        │
│    {slug}.{domain}/mcp          → MCP sidecar (port 8001)       │
│    {slug}.{domain}/api/v1/*     → REST API (port 8002)          │
│    {slug}.{domain}/repo.git/*   → Git smart HTTP (port 8002)    │
│    {slug}.{domain}/*            → Otterwiki WSGI (port 8000)    │
│    {domain}/auth/*              → Auth service (port 8003)      │
│    {domain}/api/*               → Management API (port 8002)    │
│    {domain}/app/*               → Static files (SPA)            │
│    {domain}                     → Static files (landing page)   │
└────────┬──────────┬──────────┬──────────┬───────────────────────┘
         │          │          │          │
    ┌────▼───┐ ┌────▼───┐ ┌───▼────┐ ┌──▼─────┐
    │Otterwiki│ │  MCP   │ │Platform│ │  Auth  │
    │ WSGI   │ │sidecar │ │  API   │ │service │
    │Gunicorn│ │FastMCP │ │ Flask  │ │ Flask  │
    │:8000   │ │:8001   │ │:8002   │ │:8003   │
    └────┬───┘ └───┬────┘ └───┬────┘ └───┬────┘
         │         │          │           │
    ┌────▼─────────▼──────────▼───────────▼───┐
    │  Shared resources                        │
    │  /srv/wikis/{slug}/repo.git   (git)      │
    │  /srv/wikis/{slug}/index.faiss (vectors) │
    │  /srv/data/wikibot.db         (SQLite)   │
    │  /srv/data/embeddings/        (model)    │
    └─────────────────────────────────────────┘

Caddy

Caddy handles TLS termination, automatic Let's Encrypt certificates (including wildcard via DNS challenge), and reverse proxy routing. It replaces API Gateway + CloudFront + ACM.

Wildcard TLS requires a DNS challenge. Caddy supports this natively with plugins for common DNS providers (Cloudflare, Route 53, OVHcloud). The DNS zone for {domain} needs API credentials configured in Caddy.

Caddy's routing is order-sensitive and matcher-based. The Caddyfile structure:

{domain} {
    handle /auth/* {
        reverse_proxy localhost:8003
    }
    handle /api/* {
        reverse_proxy localhost:8002
    }
    handle /app/* {
        root * /srv/static/app
        try_files {path} /app/index.html
        file_server
    }
    handle {
        root * /srv/static/landing
        file_server
    }
}

*.{domain} {
    @mcp path /mcp /mcp/*
    handle @mcp {
        reverse_proxy localhost:8001
    }

    @api path /api/v1/*
    handle @api {
        reverse_proxy localhost:8002
    }

    @git path /repo.git/*
    handle @git {
        reverse_proxy localhost:8002
    }

    handle {
        reverse_proxy localhost:8000
    }
}

The {slug} is extracted from the Host header by the downstream services, not by Caddy. Caddy just routes to the right backend; the backend resolves the tenant.

Why not Nginx

Caddy's automatic TLS (including wildcard via DNS challenge) eliminates certbot, cron renewal, and manual certificate management. For a single-operator deployment where the admin might not be around to fix a cert renewal failure, this matters. Nginx is more configurable but requires more maintenance. If we needed fine-grained caching rules or complex rewrite logic, Nginx would be worth the tradeoff. We don't.


Authentication

Identity model

User identity is an ATProto DID (Decentralized Identifier). A DID is a persistent, portable identifier that survives handle changes and PDS migrations. When a user logs in, we resolve their handle to a DID and store the DID as the primary key.

User {
  did: string,                 // e.g. "did:plc:abc123..." — primary identifier
  handle: string,              // e.g. "sderle.bsky.social" — display name, may change
  display_name: string,        // from ATProto profile
  avatar_url?: string,         // from ATProto profile
  username: string,            // platform username, chosen at signup (URL slug)
  created_at: ISO8601,
  wiki_count: number,
}

The did is the stable identity. The handle is refreshed from the PDS on each login (handles can change). The username is the platform-local slug used in URLs — it's chosen at signup and immutable for MVP, just like the current design.

ATProto OAuth (browser login)

Wikibot is an ATProto OAuth confidential client. The flow:

1. User enters their handle (e.g. "sderle.bsky.social") on the login page
2. Wikibot resolves the handle to a DID, then resolves the DID to a PDS URL
3. Wikibot fetches the PDS's Authorization Server metadata
   (GET {pds}/.well-known/oauth-authorization-server)
4. Wikibot sends a Pushed Authorization Request (PAR) to the PDS's AS,
   including PKCE code_challenge and DPoP proof
5. User is redirected to their PDS's authorization interface
6. User approves the authorization request
7. PDS redirects back to {domain}/auth/callback with an authorization code
8. Wikibot exchanges the code for tokens (access_token + refresh_token)
   with DPoP binding and client authentication (signed JWT)
9. Wikibot uses the access token to fetch the user's profile (DID, handle,
   display name) from their PDS
10. Wikibot mints a platform JWT, sets it as an HttpOnly cookie on .{domain}
11. Redirect to {domain}/app/

The platform JWT is signed with our own RS256 key (stored on disk, not in Secrets Manager). After step 10, the PDS is not in the runtime path — the platform JWT is self-contained and validated locally. ATProto tokens are stored in the session database for potential future use (e.g., posting to Bluesky on behalf of the user), but they're not needed for wiki operations.

Reference implementation

Bluesky maintains a Python Flask OAuth demo in bluesky-social/cookbook/python-oauth-web-app (CC-0 licensed). It implements the full ATProto OAuth flow as a confidential client using authlib for PKCE and DPoP, with joserfc for JWT/JWK handling. This is the starting point for our auth service. It handles the hard parts: handle-to-DID resolution, PDS Authorization Server discovery, PAR, DPoP nonce management, and token refresh.

Key libraries from the reference implementation:

  • authlib — PKCE, code challenge, general OAuth utilities
  • joserfc — JWK generation, JWT signing/verification, DPoP proof creation
  • requests — HTTP client for PDS communication (the demo includes a hardened HTTP client with SSRF mitigations)

MCP OAuth (Claude.ai)

This is the most architecturally significant auth flow. Claude.ai's MCP client implements standard OAuth 2.1 with Dynamic Client Registration (DCR). It discovers the Authorization Server by fetching /.well-known/oauth-protected-resource from the MCP endpoint. The AS must support DCR, PKCE, and standard token endpoints.

ATProto's OAuth profile is not directly compatible with this — ATProto uses per-user Authorization Servers (each user's PDS), whereas Claude.ai expects a single AS URL from the resource metadata endpoint.

Solution: wikibot runs its own OAuth 2.1 Authorization Server for MCP.

1. Claude.ai connects to https://{slug}.{domain}/mcp
2. Gets 401, fetches /.well-known/oauth-protected-resource
3. Discovers wikibot's AS at https://{domain}/auth/oauth
4. Performs Dynamic Client Registration at {domain}/auth/oauth/register
5. Redirects user to {domain}/auth/oauth/authorize
6. User sees wikibot's consent page:
   - If already logged in (platform JWT cookie): "Authorize Claude to access {wiki}?"
   - If not logged in: "Sign in with Bluesky" → ATProto OAuth flow → then consent
7. User approves, wikibot issues authorization code
8. Claude.ai exchanges code for access token at {domain}/auth/oauth/token
9. Claude.ai uses access token to make MCP requests
10. MCP sidecar validates token against wikibot's JWKS

Wikibot's MCP OAuth AS is a thin layer. It delegates authentication to ATProto (step 6) and handles authorization itself (does this user have access to this wiki?). The token it issues is a JWT containing the user's DID and the authorized wiki slug, signed with our RS256 key.

Required OAuth 2.1 AS endpoints:

Endpoint Purpose
/.well-known/oauth-authorization-server AS metadata (issuer, endpoints, supported grants)
/auth/oauth/register Dynamic Client Registration (RFC 7591)
/auth/oauth/authorize Authorization endpoint (consent page)
/auth/oauth/token Token endpoint (code exchange, refresh)
/.well-known/jwks.json Public key for token validation

These can be implemented with authlib's server components or hand-rolled (the spec surface is small — DCR, authorization code grant with PKCE, token issuance, JWKS).

MCP protected resource metadata

Each wiki's MCP endpoint serves its own resource metadata:

// GET https://{slug}.{domain}/.well-known/oauth-protected-resource
{
  "resource": "https://{slug}.{domain}/mcp",
  "authorization_servers": ["https://{domain}/auth/oauth"],
  "scopes_supported": ["wiki:read", "wiki:write"]
}

All wikis point to the same AS. The AS knows which wiki is being authorized because the redirect_uri and resource parameter identify the wiki.

Bearer tokens (Claude Code / API)

Unchanged from the current design. Each wiki gets a bearer token at creation time, stored as a bcrypt hash in the database. The user sees the token once. Claude Code usage:

claude mcp add {slug} \
  --transport http \
  --url https://{slug}.{domain}/mcp \
  --header "Authorization: Bearer YOUR_TOKEN"

Cross-subdomain auth

Same approach as Design/Frontend: platform JWT stored as an HttpOnly, Secure, SameSite=Lax cookie on .{domain}. Every request to any subdomain includes the cookie. The Otterwiki middleware and MCP sidecar both validate JWTs using the same public key.

Auth convergence

All three paths converge on the same identity and the same ACL check:

Browser       → ATProto OAuth → platform JWT (cookie)  → resolve DID → ACL check
Claude.ai     → MCP OAuth 2.1 → MCP access token (JWT) → resolve DID → ACL check
Claude Code   → Bearer token  → hash lookup in DB       → resolve user → ACL check

All paths → middleware → sets Otterwiki proxy headers (or authorizes MCP/API request)

Migration off ATProto

We store the DID as the primary user identifier, not the handle or PDS URL. If ATProto auth needs to be replaced, the migration path is:

  • Add alternative OAuth providers (Google, GitHub) alongside ATProto
  • Link new provider identities to existing DIDs via an identity_links table
  • Existing users continue to work; new users can sign up with either method

This is simpler than the WorkOS migration path in the original design because we already own the JWT-issuing layer — we're not migrating off a third-party token issuer.


Data Model

SQLite replaces DynamoDB

The dataset is small even at 1000 users. SQLite on local disk is simpler, faster, and free. The application layer uses SQLAlchemy (or raw sqlite3 — the schema is simple enough). If the deployment ever needs Postgres, the migration is straightforward.

The SQLite database lives at /srv/data/wikibot.db. Write concurrency is handled by SQLite's WAL mode, which supports concurrent reads with serialized writes. For a wiki platform where writes are infrequent relative to reads, this is more than adequate.

Tables

CREATE TABLE users (
    did TEXT PRIMARY KEY,              -- ATProto DID
    handle TEXT NOT NULL,              -- ATProto handle (may change)
    display_name TEXT,
    avatar_url TEXT,
    username TEXT UNIQUE NOT NULL,     -- platform slug, immutable
    created_at TEXT NOT NULL,          -- ISO8601
    wiki_count INTEGER DEFAULT 0
);

CREATE TABLE wikis (
    slug TEXT PRIMARY KEY,             -- globally unique, URL slug
    owner_did TEXT NOT NULL REFERENCES users(did),
    display_name TEXT NOT NULL,
    repo_path TEXT NOT NULL,           -- /srv/wikis/{slug}/repo.git
    mcp_token_hash TEXT NOT NULL,      -- bcrypt hash
    is_public INTEGER DEFAULT 0,
    is_paid INTEGER DEFAULT 0,
    payment_status TEXT DEFAULT 'free', -- 'free' | 'active' | 'lapsed'
    created_at TEXT NOT NULL,
    last_accessed TEXT NOT NULL,
    page_count INTEGER DEFAULT 0
);

CREATE TABLE acls (
    wiki_slug TEXT NOT NULL REFERENCES wikis(slug),
    grantee_did TEXT NOT NULL REFERENCES users(did),
    role TEXT NOT NULL,                -- 'owner' | 'editor' | 'viewer'
    granted_by TEXT NOT NULL,
    granted_at TEXT NOT NULL,
    PRIMARY KEY (wiki_slug, grantee_did)
);

CREATE TABLE oauth_sessions (
    id TEXT PRIMARY KEY,               -- session ID
    user_did TEXT NOT NULL REFERENCES users(did),
    dpop_private_jwk TEXT NOT NULL,    -- DPoP key (encrypted at rest)
    access_token TEXT,
    refresh_token TEXT,
    token_expires_at TEXT,
    created_at TEXT NOT NULL
);

CREATE TABLE mcp_oauth_clients (
    client_id TEXT PRIMARY KEY,        -- DCR-issued client ID
    client_name TEXT,
    redirect_uris TEXT NOT NULL,       -- JSON array
    client_secret_hash TEXT,           -- for confidential clients
    created_at TEXT NOT NULL
);

CREATE TABLE reindex_queue (
    wiki_slug TEXT NOT NULL,
    page_path TEXT NOT NULL,
    action TEXT NOT NULL,              -- 'upsert' | 'delete'
    queued_at TEXT NOT NULL,
    PRIMARY KEY (wiki_slug, page_path)
);

Storage layout

/srv/
  wikis/
    {slug}/
      repo.git/              # bare git repo
      index.faiss            # FAISS vector index
      embeddings.json        # page_path → vector mapping
  data/
    wikibot.db               # SQLite database
    signing_key.pem          # RS256 private key for JWT signing
    signing_key.pub          # RS256 public key
    client_jwk.json          # ATProto OAuth confidential client JWK (private)
    client_jwk_pub.json      # ATProto OAuth client JWK (public, served at client_id URL)
  static/
    landing/                 # landing page HTML/CSS/JS
    app/                     # management SPA
  embeddings/
    model/                   # all-MiniLM-L6-v2 model files
  backups/                   # local backup staging

Compute

Otterwiki (WSGI)

Otterwiki runs as a persistent Gunicorn process. The multi-tenant middleware we built for Lambda ports back to WSGI by removing the Mangum wrapper. The middleware:

  1. Extracts the wiki slug from the Host header
  2. Looks up the wiki in SQLite
  3. Resolves the user from the platform JWT (cookie) or bearer token
  4. Checks ACL permissions
  5. Sets Otterwiki proxy headers (x-otterwiki-email, x-otterwiki-name, x-otterwiki-permissions)
  6. Swaps Otterwiki's config to point at the correct repo path
  7. Delegates to Otterwiki's Flask app

The config-swapping is the multi-tenancy mechanism we already built. In Lambda, it happened per-invocation; in WSGI, it happens per-request. The difference is negligible — the config is a handful of in-memory variables, not file I/O.

Gunicorn runs with multiple workers (e.g., 4 workers for a small VPS). Each worker handles one request at a time. Git write operations are serialized per-repo by git's own lock file, same as on EFS.

MCP sidecar (FastMCP)

FastMCP runs as a separate process serving Streamable HTTP on port 8001. It reads git repos directly from /srv/wikis/{slug}/repo.git — same code as the current MCP server, same tools, same return formats.

The sidecar validates MCP OAuth tokens (JWTs signed by our AS) and bearer tokens (bcrypt hash lookup in SQLite). Token validation is the same logic as the Otterwiki middleware, factored into a shared library.

Why a separate process: Otterwiki is a Flask app designed around page rendering. The MCP server is an async protocol handler. Mixing them in one process would require either making Otterwiki async (large refactor) or running FastMCP synchronously (defeats the purpose). Separate processes, same database, same git repos.

Platform API (Flask)

A lightweight Flask app handling the management API (wiki CRUD, ACL management, token generation) and the Git smart HTTP protocol. This is the same API surface described in Design/Implementation_Phases, with SQLite queries instead of DynamoDB calls.

The Git smart HTTP endpoints (/repo.git/info/refs, /repo.git/git-upload-pack, /repo.git/git-receive-pack) use dulwich to serve the bare repos on disk. Free tier gets read-only (upload-pack only); premium gets read-write.

Auth service (Flask)

Handles both ATProto OAuth (browser login) and the MCP OAuth 2.1 AS. Runs as its own process because the OAuth flows involve redirects and state management that are cleaner in isolation.

This could be merged into the platform API process. Separating it keeps the auth code (which is security-critical and relatively complex) isolated from the CRUD endpoints. If the separation proves to be operationally annoying, merge them — they're both Flask apps talking to the same SQLite database.


The embedding pipeline simplifies dramatically on a VPS. No DynamoDB Streams, no event source mappings, no separate embedding Lambda. MiniLM loads once at process startup and stays in memory.

Write path

Page write (Otterwiki or MCP)
  → Middleware writes {wiki_slug, page_path, action} to reindex_queue table in SQLite
  → Background worker (in-process thread or separate process) polls the queue:
      1. Read page content from git repo on disk
      2. Chunk page
      3. Embed chunks using MiniLM (already loaded in memory)
      4. Update FAISS index on disk
      5. Delete queue entry

The background worker can be a simple thread in the Otterwiki process (using Python's threading or concurrent.futures), a separate huey or rq worker, or even a cron job that runs every 30 seconds. The latency requirement is loose — research wikis are written by AI agents and searched minutes later.

For simplicity, start with an in-process thread pool. If it causes issues (GIL contention under load, memory pressure from MiniLM in every Gunicorn worker), move to a dedicated worker process that loads MiniLM once and processes the queue.

Search path

Synchronous, handled by the MCP sidecar or REST API:

  1. MiniLM is loaded at process startup (the MCP sidecar and API processes both load it)
  2. Embed the query
  3. Load FAISS index from disk (cached in memory after first load)
  4. Search, deduplicate, return results

On a VPS, loading the FAISS index is a local disk read (<1ms for a typical wiki). No EFS mount latency, no Lambda cold start loading the model.

Model loading strategy

MiniLM (~80MB) loads in ~500ms. On a VPS with persistent processes, this happens once at startup. In the Lambda architecture, it happened on every cold start. This is one of the clearest wins of the VPS approach.

If memory is tight on the shared VPS, only the MCP sidecar needs MiniLM loaded (it handles semantic search). The Otterwiki process and platform API don't need it — they just write to the reindex queue.


Backup and Disaster Recovery

What we're protecting

Data Location Severity of loss
Git repos /srv/wikis/*/repo.git Critical — user data
SQLite database /srv/data/wikibot.db High — reconstructable from repos but painful
FAISS indexes /srv/wikis/*/index.faiss Low — rebuildable from repo content
Signing keys /srv/data/*.pem, /srv/data/*.json High — loss invalidates all active sessions

Backup strategy

Git repos: rsync to offsite storage (a second VPS, an S3 bucket, or a Backblaze B2 bucket). Daily, with a cron job. Repos are bare git — rsync handles them efficiently. Also: users can git clone their own repos at any time, which is distributed backup by design.

SQLite: .backup command (online backup, doesn't block writes in WAL mode) to a local snapshot file, then rsync offsite with the git repos. Daily.

Signing keys: Backed up once at creation time, stored separately from the data backups (e.g., in a password manager or encrypted at rest on a different system). These rarely change.

FAISS indexes: Not backed up. Rebuildable from repo content. Loss triggers a one-time re-embedding — seconds per wiki.

Recovery

If the VPS dies completely, recovery is:

  1. Provision a new VPS (any provider)
  2. Install dependencies, deploy application code
  3. Restore signing keys
  4. Restore SQLite database from backup
  5. Restore git repos from backup (or users re-push from their clones)
  6. Re-embed all wikis (automated script, runs in minutes)
  7. Update DNS to point to new VPS

RTO: hours (mostly limited by repo restore transfer time). RPO: 24 hours (daily backup cycle). This is acceptable for a free/community service. If tighter RPO is needed, increase backup frequency or add streaming replication to a standby.


Deployment

Application deployment

Code lives in a Git repo. Deployment is git pull + restart services. No Pulumi, no CloudFormation, no CI/CD pipeline required (though one can be added).

ssh vps
cd /srv/app
git pull
pip install -r requirements.txt --break-system-packages
sudo systemctl restart wikibot-otterwiki
sudo systemctl restart wikibot-mcp
sudo systemctl restart wikibot-api
sudo systemctl restart wikibot-auth
# Caddy doesn't need restart for app deploys

Or with Docker Compose:

ssh vps
cd /srv/app
git pull
docker compose build
docker compose up -d

Initial setup

  1. Provision VPS, install OS packages (Python 3.11+, git, Caddy)
  2. Configure DNS: {domain} and *.{domain} pointing to VPS IP
  3. Configure Caddy with DNS challenge credentials for wildcard TLS
  4. Generate RS256 signing keypair
  5. Generate ATProto OAuth client JWK
  6. Publish client metadata at https://{domain}/auth/client-metadata.json
  7. Initialize SQLite database (run migration script)
  8. Download MiniLM model to /srv/embeddings/model/
  9. Start services

Monitoring

For a community-hosted service, keep monitoring simple:

  • Health checks: Each service exposes a /health endpoint. Caddy or an external monitor (UptimeRobot, free tier) pings them.
  • Logs: systemd journal or Docker logs. No ELK stack, no CloudWatch. journalctl -u wikibot-otterwiki --since "1 hour ago" is sufficient at this scale.
  • Disk space: A cron job that alerts (email or Bluesky DM) when disk usage exceeds 80%.
  • Backups: The backup cron job logs success/failure. Alert on failure.

If the service grows, add Prometheus + Grafana. Not before.


What changes vs. what stays the same

Stays the same

  • ACL model (owner/editor/viewer roles, same permission matrix)
  • Otterwiki proxy header mechanism (x-otterwiki-email, x-otterwiki-name, x-otterwiki-permissions)
  • Multi-tenant middleware logic (resolve slug → look up wiki → check ACL → set headers → delegate)
  • MCP tools (read_note, write_note, search, semantic_search, list_notes, etc.)
  • REST API surface (same endpoints, same request/response shapes)
  • URL structure ({slug}.{domain}/ for wikis, {domain}/app/ for management)
  • Wiki bootstrap template
  • FAISS + MiniLM semantic search
  • Freemium tier model and limits
  • Lapse policy (read-only + MCP disabled)
  • Git remote access (read-only free, read-write premium)
  • Frontend SPA (same screens, same Svelte app, served by Caddy instead of CloudFront)
  • Otterwiki admin panel disposition (same sections hidden/shown)

Changes

Component AWS architecture VPS architecture
Hosting Lambda + EFS + API Gateway Gunicorn + local disk + Caddy
Database DynamoDB (on-demand) SQLite (WAL mode)
Auth provider WorkOS AuthKit ATProto OAuth (self-hosted)
MCP OAuth AS WorkOS (standalone connect) Self-hosted OAuth 2.1 AS
Identity OAuth provider sub (Google/GitHub/etc.) ATProto DID
TLS ACM + CloudFront Caddy + Let's Encrypt
Embedding trigger DynamoDB Streams → Lambda SQLite queue → background worker
Static hosting S3 + CloudFront Caddy file_server
IaC Pulumi systemd units or Docker Compose
Secrets Secrets Manager (Phase 4) Files on disk (encrypted at rest via LUKS or similar)
Backups AWS Backup + DynamoDB PITR rsync + SQLite .backup
Cost at rest ~\(0.50/mo (Phase 0–3), ~\)13–18/mo (Phase 4+) $0 (community server)
Cost at 1K users ~$15–20/mo $0 (community server)

What can be reused from existing implementation

  • Multi-tenant middleware — remove Mangum wrapper, the WSGI middleware is underneath
  • MCP server tools — identical, just change the repo path prefix
  • REST API handlers — swap DynamoDB calls for SQLite queries
  • Otterwiki fork — identical, same proxy header auth mode
  • Semantic search plugin — identical
  • FAISS indexing code — identical
  • Frontend SPA — identical (change VITE_API_BASE_URL, remove WorkOS client ID)
  • Wiki bootstrap template — identical
  • ACL checking logic — swap DynamoDB reads for SQLite reads

Open Questions

  1. ATProto Python OAuth library maturity. The Bluesky Flask demo uses authlib + joserfc and is CC-0 licensed. It's a reference implementation, not a maintained library. We'd be copying and adapting it, not importing a package. Is the DPoP/PAR implementation battle-tested enough, or do we need to audit it carefully?

  2. MCP OAuth AS scope. Building a spec-compliant OAuth 2.1 AS (with DCR, PKCE, token refresh, JWKS) is a meaningful amount of work. authlib has server-side components that can handle some of this. How much can we lean on authlib vs. hand-rolling? The Bluesky Flask demo is client-side only.

  3. Shared VPS resource constraints. A community server has finite RAM and CPU. MiniLM (~80MB in memory per process that loads it), Gunicorn workers, FAISS indexes, and SQLite all compete for resources. What are the actual resource limits on the OVHcloud community server? This determines how many Gunicorn workers we can run and whether the embedding worker should be in-process or separate.

  4. Domain name. The domain appears throughout the architecture (Caddy config, ATProto client metadata, JWT issuer, MCP resource metadata). What domain are we using? The ATProto client metadata URL IS the client_id in the protocol — it needs to be stable. Changing the domain later means re-registering the client and invalidating all active sessions.

  5. Caddy DNS challenge provider. Wildcard TLS requires DNS API access. Which DNS provider hosts the zone, and does Caddy have a plugin for it? Cloudflare, Route 53, and OVHcloud are all supported. The DNS provider choice should be made before deployment.

  6. Account creation UX with ATProto. When a new user arrives, they enter their Bluesky handle and go through the ATProto OAuth flow. When they come back, we need them to pick a platform username (for their wiki slug). The current design has username selection at signup — this still works, but the flow is: enter handle → authorize on PDS → pick username → create wiki. Is that smooth enough, or should we default the username to their handle (minus the .bsky.social suffix) and let them change it?

  7. Claude.ai MCP OAuth compatibility. The self-hosted OAuth 2.1 AS approach should work — Claude.ai's MCP client follows standard OAuth 2.1 discovery. But the actual implementation needs testing against Claude.ai's specific client behavior (which headers it sends, how it handles token refresh, whether it supports DPoP). The GitHub issues around Claude.ai MCP OAuth suggest it can be finicky. Plan for a debugging cycle.

  8. ATProto scopes. The ATProto OAuth spec has "transitional" scopes (transition:generic). We only need authentication (identity), not authorization to act on the user's PDS. Is there a read-only or identity-only scope, or do we request transition:generic and just not use the access token for anything beyond profile fetching?