Operations (f9b67c)

This page is part of the wikibot.io PRD (Product Requirements Document). See also: Design/Platform_Overview, Design/Data_Model, Design/Auth, Design/Implementation_Phases.

Infrastructure cost model

Fixed monthly costs regardless of user count:

Fixed monthly costs by phase (dev environment, 1 AZ):

Phase	What's added	New fixed cost	Cumulative
0–3	EFS + VPC + gateway endpoints (DynamoDB, S3) + Route 53	$0.50	~$0.50/mo
4 (pre-launch)	Secrets Manager endpoint (secrets out of env vars), WAF	~$12-17	~$13-18/mo
5+ (premium)	(no new endpoints — semantic search uses DynamoDB Streams + local MiniLM)	~$0	~$13-18/mo
When needed	WorkOS custom domain	$99	+$99/mo

Phases 0–3 use Pulumi-managed environment variables for secrets (redeploy to rotate). Secrets Manager is introduced pre-launch when rotation without redeployment and audit trails matter.

Production (2 AZs) doubles interface endpoint costs: ~$1/mo (Phases 0–3), ~$26-36/mo (Phase 4), ~$54-64/mo (Phase 5+).

Items that are always free or near-zero:

EFS storage: $0.016/GB/mo IA — pennies at low data volume
VPC itself: $0 (subnets, security groups, route tables have no hourly cost)
DynamoDB on-demand: pay per request, negligible at low traffic
Lambda: scales to zero
CloudFront: free tier covers light traffic
WorkOS: free to 1M MAU

Variable costs scale with usage:

Item	Cost	At 1K users	At 10K users
Lambda	$0.20/1M requests	~$0.20	~$2
DynamoDB on-demand	$1.25/1M writes, $0.25/1M reads	~$1	~$5
EFS IA storage	$0.016/GB/month	~$0.03	~$0.30
CloudFront	Free tier covers 1TB/mo	~$0	~$0
Bedrock	N/A (eliminated)	$0	$0
WorkOS	$0 to 1M MAU	$0	$0

Why it's not zero: EFS requires Lambda to run in a VPC. EFS itself is accessed via mount targets in the VPC (no endpoint needed). But VPC Lambda can't reach other AWS services (DynamoDB, S3) over the public internet — it needs either a NAT Gateway ($32/mo — too expensive) or VPC endpoints. Gateway endpoints (DynamoDB, S3) are free. Interface endpoints (Secrets Manager) cost ~$7/mo/AZ — only needed when Secrets Manager is introduced pre-launch (Phase 4). Bedrock and SQS endpoints were originally planned but have been eliminated by switching to DynamoDB Streams and local MiniLM embeddings.

Bottom line: ~$0.50/mo at rest in Phase 0. ~$13-18/mo from Phase 4 (Secrets Manager endpoint + WAF). No further increase for premium features. "Near-zero cost at rest" is accurate.

Wiki Bootstrap Template

When a user creates a new wiki, the repo is initialized with a starter page set that teaches Claude how to use the wiki effectively. This is the onboarding experience — the user connects MCP, starts a conversation, and Claude already knows the conventions.

Initial pages

Home — Landing page with the wiki's name and purpose (user-provided at creation), links to the guide and any starter pages.

Meta/Wiki Usage Guide — Instructions for the AI assistant:

Available MCP tools and what they do
Session start protocol (read Home first, then check recent changes)
Page conventions: frontmatter schema, WikiLink syntax, page size guidance (~250–800 words)
Commit message format
When to create new pages vs. update existing ones
How to use categories and tags
Gardening responsibilities (orphan detection, stale page review, link maintenance)

Meta/Page Template — A reference page showing the frontmatter schema, section structure, and WikiLink usage. Claude can copy this pattern when creating new pages.

Customization

The bootstrap template is parameterized by:

Wiki name (provided at creation)
Wiki purpose/description (optional, provided at creation)
Category set (default set provided, user can customize later)

The default category set matches the existing schema (actor, event, trend, hypothesis, variable, reference, index) but users can define their own categories for different research domains.

Custom template repos (premium)

Premium users can create a wiki from any public (or authenticated) Git repo URL. The Lambda clones the template repo, strips its git history, and commits the contents as the wiki's initial state. This enables:

Shared team templates ("our standard research wiki layout")
Domain-specific starter kits (e.g., a policy analysis template, a technical due diligence template)
Community-contributed templates (a future marketplace opportunity)

Implementation

The default template is stored on EFS (or bundled with the Lambda deployment). On wiki creation, the Lambda copies it to the new repo directory, performs string substitution for wiki name/purpose, and commits as the initial state. For custom templates (premium), the Lambda clones from the provided Git URL instead.

Attachment Storage

Otterwiki stores attachments as regular files in the git repo and serves them directly from the working tree. With EFS, this works the same as on a VPS — no clone overhead, attachments are just files on a mounted filesystem.

MVP approach

Store attachments in the git repo as-is. Tier limits (50MB free, 1GB premium) keep repo sizes manageable. EFS serves files directly with no performance concern.

Future optimization: S3-backed attachment serving

If large attachments become a problem (EFS cost, Git remote clone times), decouple attachment storage from the git repo:

On upload: store the attachment in S3 at a known path (s3://attachments/{user}/{wiki}/{filename}), commit only a lightweight reference file to git (similar to Git LFS pointer format)
On serve: intercept Otterwiki's attachment serving path, resolve the reference, and redirect to S3 (or serve via CloudFront)

This could be implemented as:

Otterwiki plugin that hooks into the attachment upload/serve lifecycle
Upstream patch to Otterwiki adding a pluggable storage backend for attachments (local filesystem vs. S3 vs. LFS)
Lambda middleware that intercepts attachment routes and serves from S3 before Otterwiki sees the request

The plugin or upstream patch approach is preferable — it benefits the broader Otterwiki community and keeps our fork minimal.

Git Remote Access

Every wiki's bare repo is directly accessible via Git protocol over HTTPS. This is a core feature, not an afterthought — users should never feel locked in.

Hosted Git remote

https://sderle.wikibot.io/third-gulf-war.git

Authentication: OAuth JWT or MCP bearer token via Git credential helper, or a dedicated Git access token (simpler for CLI usage).

Free tier: read-only. Users can git clone and git pull their wiki at any time. This is a data portability guarantee — your wiki is always yours.

Premium tier: read/write. Users can git push to the hosted remote, enabling workflows like local editing, CI/CD integration, or scripted bulk imports.

Implementation

API Gateway route (/{user}/{wiki}.git/*) → Lambda implementing Git smart HTTP protocol (git-upload-pack for clone/fetch, git-receive-pack for push). The Lambda accesses the same EFS-mounted repo as the wiki handlers.

This is a well-documented protocol — the Lambda needs to handle the handful of Git smart HTTP endpoints (/info/refs, /git-upload-pack, /git-receive-pack). Libraries like dulwich can serve these without shelling out to git.

External Git sync (premium, future)

Bidirectional sync with an external remote (GitHub, GitLab, etc.). A separate Lambda triggered on schedule (EventBridge) or webhook:

Open wiki repo on EFS
git fetch from configured external remote
Attempt fast-forward merge (no conflicts → auto-merge)
Conflicts → flag for human resolution, do not auto-merge
Push merged state to external remote
Trigger re-embedding if semantic search enabled

Credentials for external remotes stored in Secrets Manager (per-wiki secret, auto-rotation support). This is a future feature — the hosted Git remote is the MVP.

Platform: AWS Lambda + EFS

Why AWS + EFS

EFS (Elastic File System) is AWS's managed NFS service. Lambda mounts EFS volumes directly, eliminating the git-on-S3 clone/push cycle. Git repos live on a persistent filesystem — Lambda reads/writes them in place, just like a VPS. Combined with AWS's managed service catalog (Bedrock, SQS, DynamoDB, CloudFront, ACM) and first-class Pulumi support, this is the strongest fit.

Key properties

Git repos on EFS work like local disk — no clone, no push, no S3 sync, no write locks
EFS Infrequent Access: $0.016/GB/month — a 2MB wiki costs ~$0.00003/month at rest
NFS handles concurrent access natively (git's index.lock works on NFS)
Lambda scales to zero; EFS cost is storage-only when idle
Built-in backup via AWS Backup (to S3)

VPC networking

EFS requires Lambda to run in a VPC. VPC Lambda can't reach AWS services (DynamoDB, SQS, Bedrock, S3) over the public internet — it needs either a NAT Gateway ($32/mo minimum, kills "zero cost at rest") or VPC endpoints.

Gateway endpoints (free): DynamoDB, S3 — route traffic through the VPC route table
Interface endpoints (~$7/mo each per AZ): Secrets Manager — ENI-based, billed hourly + per-GB. SQS and Bedrock endpoints are no longer needed (semantic search uses DynamoDB Streams + local MiniLM embeddings; see Design/Async_Embedding_Pipeline)
Minimize AZs in dev (1 AZ = 1× endpoint cost); prod needs 2 AZs for availability

This is a Phase 0 infrastructure requirement — without endpoints, Lambda can mount EFS but can't reach DynamoDB for ACL checks.

Known trade-offs

EFS requires a VPC → ~1–2s added to Lambda cold starts (Provisioned Concurrency is available if this proves unacceptable — ~$10-15/mo for 1 warm instance)
EFS latency (~1–5ms per op) is higher than local disk but adequate for git
Mangum adapter needed for Flask on Lambda
API Gateway 29s timeout limits long operations
VPC interface endpoints add fixed monthly cost (~$15-25/mo in prod for SQS + Bedrock + Secrets Manager across 2 AZs)

S3 fallback

If Phase 0 testing shows EFS latency or VPC cold starts are unacceptable, fall back to S3-based repos with DynamoDB write locks and clone-to-/tmp. Adds significant complexity (locking, cache management, /tmp eviction) — last resort only.

Alternatives considered

Fly.io Machines: Persistent volumes, unmodified Otterwiki, simplest architecture. But weaker IaC, no managed services (embeddings, queues, metadata). Simplest fallback if EFS fails.

Google Cloud Run: Cloud Storage FUSE less proven than EFS for filesystem workloads. No clear advantage over AWS+EFS. Less familiar territory.

Phase 0 validates this decision

Exit criteria: page read <500ms warm, page write <1s warm, cold start <5s total.

Infrastructure as Code

All infrastructure is managed declaratively. No manual console clicking, no snowflake configuration.

Tool: Pulumi (Python)

Pulumi with the Python SDK is the primary IaC tool. Rationale:

Application is Python, so infrastructure code in the same language reduces context switching
Pulumi has first-class AWS support (Lambda, EFS, API Gateway, DynamoDB, ACM, CloudFront, VPC)
Full programming language (loops, conditionals, abstractions) vs. HCL's declarative constraints
Strong secret management built in (pulumi config set --secret)

What's managed by IaC

Everything that isn't application code:

Lambda functions (compute, async handlers)
EFS filesystem and mount targets
VPC, subnets, security groups, VPC endpoints (gateway + interface)
API Gateway and CloudFront distributions
WAF web ACLs and rule groups
DynamoDB tables
ACM certificates and Route 53 DNS records
IAM roles and policies
SQS queues
Secrets Manager secrets
EventBridge schedules
CloudWatch monitoring, alerting, and billing alarms
X-Ray tracing configuration
Auth provider configuration (WorkOS AuthKit)
Stripe webhook endpoints

What's NOT managed by IaC

Application code (deployed via CI/CD pipeline, not Pulumi)
User data (wiki repos, DynamoDB records)
Secrets from external services (Stripe API keys, auth provider secrets) — stored in Pulumi config (pulumi config set --secret) and injected as Lambda environment variables during development. Migrated to AWS Secrets Manager pre-launch (Phase 4) for rotation without redeployment and audit trails. The VPC interface endpoint for Secrets Manager is only needed at that point.

Repository structure

wikibot-io/
  infra/                  # Pulumi project
    __main__.py           # Infrastructure definition
    config/
      dev.yaml            # Dev stack config
      prod.yaml           # Prod stack config
  app/                    # Application code
    otterwiki/            # Fork (git submodule or subtree)
    api/                  # REST API handlers
    mcp/                  # MCP server handlers
    management/           # User/wiki management API
    frontend/             # SPA
  deploy/                 # CI/CD pipeline definitions

Otterwiki Fork Management

The Otterwiki fork is kept as minimal as possible. All customizations are either:

Plugins (preferred) — no core changes needed
Small, upstreamable patches — contributed to schuyler/otterwiki and submitted as PRs to the upstream redimp/otterwiki project
Platform-specific overrides — admin panel section hiding, template conditionals (kept in a separate branch or patch set)

Merge strategy

Track upstream redimp/otterwiki as a remote
Periodically rebase or merge upstream changes into the fork
Keep platform-specific changes isolated (ideally a thin layer on top, not interleaved with upstream code)
Automated CI check: does the fork still pass upstream's test suite after merge?

Upstream relationship

We want to support Otterwiki as a project. Contributions go upstream where possible. If the product generates revenue, donate a portion to the upstream maintainer.

Backup and Disaster Recovery

What we're protecting

Data	Source of truth	Severity of loss
Git repos (wiki content)	EFS	Critical — user data, irreplaceable
DynamoDB (users, wikis, ACLs)	DynamoDB	High — reconstructable from repos but painful
FAISS indexes	EFS	Low — fully rebuildable from repo content
Auth provider state	WorkOS (external)	Low — managed by vendor

Backup strategy

Git repos (EFS): AWS Backup with daily snapshots, 30-day retention. EFS supports point-in-time recovery via backup. Cost: negligible for small repos.

DynamoDB: Point-in-Time Recovery (PITR) — continuous backups, restore to any second in the last 35 days. Cost: ~$0.20/GB/month (pennies for our data volume).

FAISS indexes: No backup needed. Rebuildable from repo content + embedding API. Loss means a one-time re-embedding cost (~$0.02 per wiki).

Recovery scenarios

Scenario	Recovery path	RPO	RTO
Single wiki repo corrupted	Restore from EFS backup snapshot	24h (daily backup)	Minutes
Bad push overwrites repo	Restore from EFS backup snapshot	24h	Minutes
DynamoDB corruption	PITR restore	Seconds	Minutes
DynamoDB total loss	PITR restore; worst case reconstruct from EFS repo inventory	Seconds	Hours
FAISS index lost	Re-embed all pages for affected wiki	N/A (rebuildable)	Minutes per wiki
Full region outage	Accept downtime	N/A	Depends on provider recovery

Design principle

Git repos are the source of truth. Everything else (DynamoDB records, FAISS indexes) is either backed up with PITR or rebuildable from the repos. A DynamoDB wipe is painful but survivable — you can walk the EFS filesystem and reconstruct user/wiki records from the repo inventory.

CI/CD

Code lives in a private GitHub repo. Deployment via GitHub Actions.

Pipeline

git push to main
  → GitHub Actions:
      1. Run tests (pytest for Python, vitest/jest for frontend)
      2. Build artifacts (Lambda zip or container image, SPA bundle)
      3. Deploy infrastructure changes (pulumi up)
      4. Deploy Lambda code (zip upload or ECR image push)
      5. Smoke test (hit health endpoint, create/read/delete a test page)

Environment strategy

dev: auto-deploy on push to main. Separate infrastructure stack (pulumi stack select dev). Own domain (dev.wikibot.io).
prod: manual promotion (GitHub Actions workflow dispatch or tag-based). Separate Pulumi stack. wikibot.io.

Account Lifecycle

Data retention

User accounts and wiki data are retained indefinitely regardless of activity. Storage cost for an idle wiki is effectively zero (a few KB in DynamoDB, a few MB of git repo on EFS Infrequent Access). There is no reason to delete inactive accounts — it costs nothing to keep them and deleting user data is irreversible.

Account deletion

Users can delete their account from the dashboard. This:

Deletes all wikis owned by the user (repo, FAISS index, metadata)
Removes all ACL grants the user has on other wikis
Deletes the user record from the DynamoDB
Does NOT delete the auth provider account (Google/GitHub/etc.) — that's the user's own account

Deletion is permanent and irreversible. Require explicit confirmation ("type your username to confirm").

If serving EU users: account deletion satisfies right-to-erasure. Add a data export endpoint (download all wikis as a zip of git repos) to satisfy right-to-portability — though the Git remote access feature already provides this.

MCP Discoverability

MCP tool descriptions must be self-documenting — any MCP-capable client (Claude, GPT, Gemini, open-source agents) should be able to use the wiki tools without reading external documentation.

Each tool's MCP description should include:

What it does
Parameter semantics (e.g., "path is like Actors/Iran, not a filesystem path")
What the return format looks like
Common next actions ("use list_notes to find available pages if you don't know the path")

The bootstrap template's Meta/Wiki Usage Guide provides Claude-specific conventions (session protocol, gardening duties), but the MCP tools themselves should work without it. The guide is optimization, not a prerequisite.

Rate Limiting and Abuse Prevention

Launch: OAuth-only accounts + tier limits (1 wiki, 500 pages, 3 collaborators) provide sufficient abuse prevention at low traffic. Public wiki routes are the only unauthenticated surface — acceptable risk at launch with near-zero users.

Post-launch (when traffic justifies it): AWS WAF on API Gateway and CloudFront. IP-based rate limiting, geographic blocking, bot control, OWASP Top 10 managed rule sets. Adds ~$5-10/mo. Deploy when there's real traffic to protect.

Per-user rate limiting (premium launch): When premium tier ships, add per-user throttling on API and MCP endpoints via API Gateway usage plans or WAF custom rules. Define specific limits when the need materializes.

Open Questions

EFS + Lambda performance: The key Phase 0 question. Does EFS latency for git operations meet targets (<500ms read, <1s write warm)? Does VPC cold start stay under 5s total?
Otterwiki on Lambda feasibility: Otterwiki has filesystem assumptions beyond the git repo (config files, static assets). How much Mangum adaptation is needed? EFS satisfies most filesystem assumptions, but Flask-on-Lambda via Mangum still requires testing.
Lambda package size: Otterwiki + gitpython + FAISS + FastMCP + Mangum. If over 250MB zip limit, use Lambda container images (up to 10GB).
Git library choice: gitpython shells out to git (binary dependency — verify availability in Lambda runtime). dulwich is pure Python (no binary, different API, possibly slower). dulwich avoids the binary question entirely.
MCP Streamable HTTP timeouts: API Gateway caps at 29s. Most MCP operations complete in <5s, but semantic search with embedding generation could approach 10–15s. Verify this isn't a problem.
Platform JWT signing key management: RS256 keypair in Secrets Manager. Need to define key rotation strategy — do we support multiple valid keys (JWKS with kid header) for zero-downtime rotation, or is manual rotation with a maintenance window acceptable for MVP?
WorkOS + FastMCP integration on Lambda: The FastMCP WorkOS integration is documented but needs validation in our specific setup (Lambda + API Gateway + VPC). Known friction points: client_secret_basic default may conflict with some MCP clients, no RFC 8707 resource indicator support. Validate in Phase 0.
Apple provider sub retrieval: WorkOS exposes raw OAuth provider sub claims via API for Google, GitHub, Microsoft. Apple is undocumented. If we can't get Apple's raw sub, Apple users can't be migrated off WorkOS without re-authenticating. Verify in Phase 0.
Otterwiki licensing: MIT licensed — permissive, should be fine for commercial use. Confirm no additional contributor agreements or trademark restrictions.
VPC endpoint costs: Interface endpoints for SQS, Bedrock, and Secrets Manager cost ~$7/mo each per AZ. In a 2-AZ prod setup, that's ~$40-50/mo fixed cost. Acceptable for a production SaaS, but worth tracking — may influence which services we use in early phases (e.g., defer Bedrock endpoint until premium tier ships).

Infrastructure cost model

Wiki Bootstrap Template

Initial pages

Customization

Custom template repos (premium)

Implementation

Attachment Storage

MVP approach

Future optimization: S3-backed attachment serving

Git Remote Access

Hosted Git remote

Implementation

External Git sync (premium, future)

Platform: AWS Lambda + EFS

Why AWS + EFS

Key properties

VPC networking

Known trade-offs

S3 fallback

Alternatives considered

Phase 0 validates this decision

Infrastructure as Code

Tool: Pulumi (Python)

What's managed by IaC

What's NOT managed by IaC

Repository structure

Otterwiki Fork Management

Merge strategy

Upstream relationship

Backup and Disaster Recovery

What we're protecting

Backup strategy

Recovery scenarios

Design principle

CI/CD

Pipeline

Environment strategy

Account Lifecycle

Data retention

Account deletion

GDPR

MCP Discoverability

Rate Limiting and Abuse Prevention

Open Questions