This page is part of the wikibot.io PRD (Product Requirements Document). See also: Design/Platform_Overview, Design/Data_Model, Design/Auth, Design/Implementation_Phases.
Infrastructure cost model
Fixed monthly costs regardless of user count:
Fixed monthly costs by phase (dev environment, 1 AZ):
| Phase | What's added | New fixed cost | Cumulative |
|---|---|---|---|
| 0–3 | EFS + VPC + gateway endpoints (DynamoDB, S3) + Route 53 | $0.50 | ~$0.50/mo |
| 4 (pre-launch) | Secrets Manager endpoint (secrets out of env vars), WAF | ~$12-17 | ~$13-18/mo |
| 5+ (premium) | (no new endpoints — semantic search uses DynamoDB Streams + local MiniLM) | ~$0 | ~$13-18/mo |
| When needed | WorkOS custom domain | $99 | +$99/mo |
Phases 0–3 use Pulumi-managed environment variables for secrets (redeploy to rotate). Secrets Manager is introduced pre-launch when rotation without redeployment and audit trails matter.
Production (2 AZs) doubles interface endpoint costs: ~\(1/mo (Phases 0–3), ~\)26-36/mo (Phase 4), ~$54-64/mo (Phase 5+).
Items that are always free or near-zero:
- EFS storage: $0.016/GB/mo IA — pennies at low data volume
- VPC itself: $0 (subnets, security groups, route tables have no hourly cost)
- DynamoDB on-demand: pay per request, negligible at low traffic
- Lambda: scales to zero
- CloudFront: free tier covers light traffic
- WorkOS: free to 1M MAU
Variable costs scale with usage:
| Item | Cost | At 1K users | At 10K users |
|---|---|---|---|
| Lambda | $0.20/1M requests | ~$0.20 | ~$2 |
| DynamoDB on-demand | $1.25/1M writes, $0.25/1M reads | ~$1 | ~$5 |
| EFS IA storage | $0.016/GB/month | ~$0.03 | ~$0.30 |
| CloudFront | Free tier covers 1TB/mo | ~$0 | ~$0 |
| Bedrock | N/A (eliminated) | $0 | $0 |
| WorkOS | $0 to 1M MAU | $0 | $0 |
Why it's not zero: EFS requires Lambda to run in a VPC. EFS itself is accessed via mount targets in the VPC (no endpoint needed). But VPC Lambda can't reach other AWS services (DynamoDB, S3) over the public internet — it needs either a NAT Gateway (\(32/mo — too expensive) or VPC endpoints. Gateway endpoints (DynamoDB, S3) are free. Interface endpoints (Secrets Manager) cost ~\)7/mo/AZ — only needed when Secrets Manager is introduced pre-launch (Phase 4). Bedrock and SQS endpoints were originally planned but have been eliminated by switching to DynamoDB Streams and local MiniLM embeddings.
Bottom line: ~\(0.50/mo at rest in Phase 0. ~\)13-18/mo from Phase 4 (Secrets Manager endpoint + WAF). No further increase for premium features. "Near-zero cost at rest" is accurate.
Wiki Bootstrap Template
When a user creates a new wiki, the repo is initialized with a starter page set that teaches Claude how to use the wiki effectively. This is the onboarding experience — the user connects MCP, starts a conversation, and Claude already knows the conventions.
Initial pages
Home — Landing page with the wiki's name and purpose (user-provided at creation), links to the guide and any starter pages.
Meta/Wiki Usage Guide — Instructions for the AI assistant:
- Available MCP tools and what they do
- Session start protocol (read Home first, then check recent changes)
- Page conventions: frontmatter schema, WikiLink syntax, page size guidance (~250–800 words)
- Commit message format
- When to create new pages vs. update existing ones
- How to use categories and tags
- Gardening responsibilities (orphan detection, stale page review, link maintenance)
Meta/Page Template — A reference page showing the frontmatter schema, section structure, and WikiLink usage. Claude can copy this pattern when creating new pages.
Customization
The bootstrap template is parameterized by:
- Wiki name (provided at creation)
- Wiki purpose/description (optional, provided at creation)
- Category set (default set provided, user can customize later)
The default category set matches the existing schema (actor, event, trend, hypothesis, variable, reference, index) but users can define their own categories for different research domains.
Custom template repos (premium)
Premium users can create a wiki from any public (or authenticated) Git repo URL. The Lambda clones the template repo, strips its git history, and commits the contents as the wiki's initial state. This enables:
- Shared team templates ("our standard research wiki layout")
- Domain-specific starter kits (e.g., a policy analysis template, a technical due diligence template)
- Community-contributed templates (a future marketplace opportunity)
Implementation
The default template is stored on EFS (or bundled with the Lambda deployment). On wiki creation, the Lambda copies it to the new repo directory, performs string substitution for wiki name/purpose, and commits as the initial state. For custom templates (premium), the Lambda clones from the provided Git URL instead.
Attachment Storage
Otterwiki stores attachments as regular files in the git repo and serves them directly from the working tree. With EFS, this works the same as on a VPS — no clone overhead, attachments are just files on a mounted filesystem.
MVP approach
Store attachments in the git repo as-is. Tier limits (50MB free, 1GB premium) keep repo sizes manageable. EFS serves files directly with no performance concern.
Future optimization: S3-backed attachment serving
If large attachments become a problem (EFS cost, Git remote clone times), decouple attachment storage from the git repo:
- On upload: store the attachment in S3 at a known path (
s3://attachments/{user}/{wiki}/{filename}), commit only a lightweight reference file to git (similar to Git LFS pointer format) - On serve: intercept Otterwiki's attachment serving path, resolve the reference, and redirect to S3 (or serve via CloudFront)
This could be implemented as:
- Otterwiki plugin that hooks into the attachment upload/serve lifecycle
- Upstream patch to Otterwiki adding a pluggable storage backend for attachments (local filesystem vs. S3 vs. LFS)
- Lambda middleware that intercepts attachment routes and serves from S3 before Otterwiki sees the request
The plugin or upstream patch approach is preferable — it benefits the broader Otterwiki community and keeps our fork minimal.
Git Remote Access
Every wiki's bare repo is directly accessible via Git protocol over HTTPS. This is a core feature, not an afterthought — users should never feel locked in.
Hosted Git remote
https://sderle.wikibot.io/third-gulf-war.git
Authentication: OAuth JWT or MCP bearer token via Git credential helper, or a dedicated Git access token (simpler for CLI usage).
Free tier: read-only. Users can git clone and git pull their wiki at any time. This is a data portability guarantee — your wiki is always yours.
Premium tier: read/write. Users can git push to the hosted remote, enabling workflows like local editing, CI/CD integration, or scripted bulk imports.
Implementation
API Gateway route (/{user}/{wiki}.git/*) → Lambda implementing Git smart HTTP protocol (git-upload-pack for clone/fetch, git-receive-pack for push). The Lambda accesses the same EFS-mounted repo as the wiki handlers.
This is a well-documented protocol — the Lambda needs to handle the handful of Git smart HTTP endpoints (/info/refs, /git-upload-pack, /git-receive-pack). Libraries like dulwich can serve these without shelling out to git.
External Git sync (premium, future)
Bidirectional sync with an external remote (GitHub, GitLab, etc.). A separate Lambda triggered on schedule (EventBridge) or webhook:
- Open wiki repo on EFS
git fetchfrom configured external remote- Attempt fast-forward merge (no conflicts → auto-merge)
- Conflicts → flag for human resolution, do not auto-merge
- Push merged state to external remote
- Trigger re-embedding if semantic search enabled
Credentials for external remotes stored in Secrets Manager (per-wiki secret, auto-rotation support). This is a future feature — the hosted Git remote is the MVP.
Platform: AWS Lambda + EFS
Why AWS + EFS
EFS (Elastic File System) is AWS's managed NFS service. Lambda mounts EFS volumes directly, eliminating the git-on-S3 clone/push cycle. Git repos live on a persistent filesystem — Lambda reads/writes them in place, just like a VPS. Combined with AWS's managed service catalog (Bedrock, SQS, DynamoDB, CloudFront, ACM) and first-class Pulumi support, this is the strongest fit.
Key properties
- Git repos on EFS work like local disk — no clone, no push, no S3 sync, no write locks
- EFS Infrequent Access: \(0.016/GB/month — a 2MB wiki costs ~\)0.00003/month at rest
- NFS handles concurrent access natively (git's
index.lockworks on NFS) - Lambda scales to zero; EFS cost is storage-only when idle
- Built-in backup via AWS Backup (to S3)
VPC networking
EFS requires Lambda to run in a VPC. VPC Lambda can't reach AWS services (DynamoDB, SQS, Bedrock, S3) over the public internet — it needs either a NAT Gateway ($32/mo minimum, kills "zero cost at rest") or VPC endpoints.
- Gateway endpoints (free): DynamoDB, S3 — route traffic through the VPC route table
- Interface endpoints (~$7/mo each per AZ): Secrets Manager — ENI-based, billed hourly + per-GB. SQS and Bedrock endpoints are no longer needed (semantic search uses DynamoDB Streams + local MiniLM embeddings; see Design/Async_Embedding_Pipeline)
- Minimize AZs in dev (1 AZ = 1× endpoint cost); prod needs 2 AZs for availability
This is a Phase 0 infrastructure requirement — without endpoints, Lambda can mount EFS but can't reach DynamoDB for ACL checks.
Known trade-offs
- EFS requires a VPC → ~1–2s added to Lambda cold starts (Provisioned Concurrency is available if this proves unacceptable — ~$10-15/mo for 1 warm instance)
- EFS latency (~1–5ms per op) is higher than local disk but adequate for git
- Mangum adapter needed for Flask on Lambda
- API Gateway 29s timeout limits long operations
- VPC interface endpoints add fixed monthly cost (~$15-25/mo in prod for SQS + Bedrock + Secrets Manager across 2 AZs)
S3 fallback
If Phase 0 testing shows EFS latency or VPC cold starts are unacceptable, fall back to S3-based repos with DynamoDB write locks and clone-to-/tmp. Adds significant complexity (locking, cache management, /tmp eviction) — last resort only.
Alternatives considered
Fly.io Machines: Persistent volumes, unmodified Otterwiki, simplest architecture. But weaker IaC, no managed services (embeddings, queues, metadata). Simplest fallback if EFS fails.
Google Cloud Run: Cloud Storage FUSE less proven than EFS for filesystem workloads. No clear advantage over AWS+EFS. Less familiar territory.
Phase 0 validates this decision
Exit criteria: page read <500ms warm, page write <1s warm, cold start <5s total.
Infrastructure as Code
All infrastructure is managed declaratively. No manual console clicking, no snowflake configuration.
Tool: Pulumi (Python)
Pulumi with the Python SDK is the primary IaC tool. Rationale:
- Application is Python, so infrastructure code in the same language reduces context switching
- Pulumi has first-class AWS support (Lambda, EFS, API Gateway, DynamoDB, ACM, CloudFront, VPC)
- Full programming language (loops, conditionals, abstractions) vs. HCL's declarative constraints
- Strong secret management built in (
pulumi config set --secret)
What's managed by IaC
Everything that isn't application code:
- Lambda functions (compute, async handlers)
- EFS filesystem and mount targets
- VPC, subnets, security groups, VPC endpoints (gateway + interface)
- API Gateway and CloudFront distributions
- WAF web ACLs and rule groups
- DynamoDB tables
- ACM certificates and Route 53 DNS records
- IAM roles and policies
- SQS queues
- Secrets Manager secrets
- EventBridge schedules
- CloudWatch monitoring, alerting, and billing alarms
- X-Ray tracing configuration
- Auth provider configuration (WorkOS AuthKit)
- Stripe webhook endpoints
What's NOT managed by IaC
- Application code (deployed via CI/CD pipeline, not Pulumi)
- User data (wiki repos, DynamoDB records)
- Secrets from external services (Stripe API keys, auth provider secrets) — stored in Pulumi config (
pulumi config set --secret) and injected as Lambda environment variables during development. Migrated to AWS Secrets Manager pre-launch (Phase 4) for rotation without redeployment and audit trails. The VPC interface endpoint for Secrets Manager is only needed at that point.
Repository structure
wikibot-io/
infra/ # Pulumi project
__main__.py # Infrastructure definition
config/
dev.yaml # Dev stack config
prod.yaml # Prod stack config
app/ # Application code
otterwiki/ # Fork (git submodule or subtree)
api/ # REST API handlers
mcp/ # MCP server handlers
management/ # User/wiki management API
frontend/ # SPA
deploy/ # CI/CD pipeline definitions
Otterwiki Fork Management
The Otterwiki fork is kept as minimal as possible. All customizations are either:
- Plugins (preferred) — no core changes needed
- Small, upstreamable patches — contributed to
schuyler/otterwikiand submitted as PRs to the upstreamredimp/otterwikiproject - Platform-specific overrides — admin panel section hiding, template conditionals (kept in a separate branch or patch set)
Merge strategy
- Track upstream
redimp/otterwikias a remote - Periodically rebase or merge upstream changes into the fork
- Keep platform-specific changes isolated (ideally a thin layer on top, not interleaved with upstream code)
- Automated CI check: does the fork still pass upstream's test suite after merge?
Upstream relationship
We want to support Otterwiki as a project. Contributions go upstream where possible. If the product generates revenue, donate a portion to the upstream maintainer.
Backup and Disaster Recovery
What we're protecting
| Data | Source of truth | Severity of loss |
|---|---|---|
| Git repos (wiki content) | EFS | Critical — user data, irreplaceable |
| DynamoDB (users, wikis, ACLs) | DynamoDB | High — reconstructable from repos but painful |
| FAISS indexes | EFS | Low — fully rebuildable from repo content |
| Auth provider state | WorkOS (external) | Low — managed by vendor |
Backup strategy
Git repos (EFS): AWS Backup with daily snapshots, 30-day retention. EFS supports point-in-time recovery via backup. Cost: negligible for small repos.
DynamoDB: Point-in-Time Recovery (PITR) — continuous backups, restore to any second in the last 35 days. Cost: ~$0.20/GB/month (pennies for our data volume).
FAISS indexes: No backup needed. Rebuildable from repo content + embedding API. Loss means a one-time re-embedding cost (~$0.02 per wiki).
Recovery scenarios
| Scenario | Recovery path | RPO | RTO |
|---|---|---|---|
| Single wiki repo corrupted | Restore from EFS backup snapshot | 24h (daily backup) | Minutes |
| Bad push overwrites repo | Restore from EFS backup snapshot | 24h | Minutes |
| DynamoDB corruption | PITR restore | Seconds | Minutes |
| DynamoDB total loss | PITR restore; worst case reconstruct from EFS repo inventory | Seconds | Hours |
| FAISS index lost | Re-embed all pages for affected wiki | N/A (rebuildable) | Minutes per wiki |
| Full region outage | Accept downtime | N/A | Depends on provider recovery |
Design principle
Git repos are the source of truth. Everything else (DynamoDB records, FAISS indexes) is either backed up with PITR or rebuildable from the repos. A DynamoDB wipe is painful but survivable — you can walk the EFS filesystem and reconstruct user/wiki records from the repo inventory.
CI/CD
Code lives in a private GitHub repo. Deployment via GitHub Actions.
Pipeline
git push to main
→ GitHub Actions:
1. Run tests (pytest for Python, vitest/jest for frontend)
2. Build artifacts (Lambda zip or container image, SPA bundle)
3. Deploy infrastructure changes (pulumi up)
4. Deploy Lambda code (zip upload or ECR image push)
5. Smoke test (hit health endpoint, create/read/delete a test page)
Environment strategy
- dev: auto-deploy on push to
main. Separate infrastructure stack (pulumi stack select dev). Own domain (dev.wikibot.io). - prod: manual promotion (GitHub Actions workflow dispatch or tag-based). Separate Pulumi stack.
wikibot.io.
Account Lifecycle
Data retention
User accounts and wiki data are retained indefinitely regardless of activity. Storage cost for an idle wiki is effectively zero (a few KB in DynamoDB, a few MB of git repo on EFS Infrequent Access). There is no reason to delete inactive accounts — it costs nothing to keep them and deleting user data is irreversible.
Account deletion
Users can delete their account from the dashboard. This:
- Deletes all wikis owned by the user (repo, FAISS index, metadata)
- Removes all ACL grants the user has on other wikis
- Deletes the user record from the DynamoDB
- Does NOT delete the auth provider account (Google/GitHub/etc.) — that's the user's own account
Deletion is permanent and irreversible. Require explicit confirmation ("type your username to confirm").
GDPR
If serving EU users: account deletion satisfies right-to-erasure. Add a data export endpoint (download all wikis as a zip of git repos) to satisfy right-to-portability — though the Git remote access feature already provides this.
MCP Discoverability
MCP tool descriptions must be self-documenting — any MCP-capable client (Claude, GPT, Gemini, open-source agents) should be able to use the wiki tools without reading external documentation.
Each tool's MCP description should include:
- What it does
- Parameter semantics (e.g., "path is like
Actors/Iran, not a filesystem path") - What the return format looks like
- Common next actions ("use
list_notesto find available pages if you don't know the path")
The bootstrap template's Meta/Wiki Usage Guide provides Claude-specific conventions (session protocol, gardening duties), but the MCP tools themselves should work without it. The guide is optimization, not a prerequisite.
Rate Limiting and Abuse Prevention
Launch: OAuth-only accounts + tier limits (1 wiki, 500 pages, 3 collaborators) provide sufficient abuse prevention at low traffic. Public wiki routes are the only unauthenticated surface — acceptable risk at launch with near-zero users.
Post-launch (when traffic justifies it): AWS WAF on API Gateway and CloudFront. IP-based rate limiting, geographic blocking, bot control, OWASP Top 10 managed rule sets. Adds ~$5-10/mo. Deploy when there's real traffic to protect.
Per-user rate limiting (premium launch): When premium tier ships, add per-user throttling on API and MCP endpoints via API Gateway usage plans or WAF custom rules. Define specific limits when the need materializes.
Open Questions
EFS + Lambda performance: The key Phase 0 question. Does EFS latency for git operations meet targets (<500ms read, <1s write warm)? Does VPC cold start stay under 5s total?
Otterwiki on Lambda feasibility: Otterwiki has filesystem assumptions beyond the git repo (config files, static assets). How much Mangum adaptation is needed? EFS satisfies most filesystem assumptions, but Flask-on-Lambda via Mangum still requires testing.
Lambda package size: Otterwiki + gitpython + FAISS + FastMCP + Mangum. If over 250MB zip limit, use Lambda container images (up to 10GB).
Git library choice: gitpython shells out to
git(binary dependency — verify availability in Lambda runtime). dulwich is pure Python (no binary, different API, possibly slower). dulwich avoids the binary question entirely.MCP Streamable HTTP timeouts: API Gateway caps at 29s. Most MCP operations complete in <5s, but semantic search with embedding generation could approach 10–15s. Verify this isn't a problem.
Platform JWT signing key management: RS256 keypair in Secrets Manager. Need to define key rotation strategy — do we support multiple valid keys (JWKS with
kidheader) for zero-downtime rotation, or is manual rotation with a maintenance window acceptable for MVP?WorkOS + FastMCP integration on Lambda: The FastMCP WorkOS integration is documented but needs validation in our specific setup (Lambda + API Gateway + VPC). Known friction points:
client_secret_basicdefault may conflict with some MCP clients, no RFC 8707 resource indicator support. Validate in Phase 0.Apple provider sub retrieval: WorkOS exposes raw OAuth provider
subclaims via API for Google, GitHub, Microsoft. Apple is undocumented. If we can't get Apple's raw sub, Apple users can't be migrated off WorkOS without re-authenticating. Verify in Phase 0.Otterwiki licensing: MIT licensed — permissive, should be fine for commercial use. Confirm no additional contributor agreements or trademark restrictions.
VPC endpoint costs: Interface endpoints for SQS, Bedrock, and Secrets Manager cost ~\(7/mo each per AZ. In a 2-AZ prod setup, that's ~\)40-50/mo fixed cost. Acceptable for a production SaaS, but worth tracking — may influence which services we use in early phases (e.g., defer Bedrock endpoint until premium tier ships).