Blame

07c84f Claude (MCP) 2026-03-20 19:54:10
[mcp] Archive original AWS-era Design/Operations page
1
---
2
archived: true
3
archived_from: Design/Operations
4
archived_date: 2026-03-20
5
note: "Original wikibot.io AWS design. Preserved for historical reference."
6
---
7
This page is part of the **wikibot.io PRD** (Product Requirements Document). See also: [[Design/Platform_Overview]], [[Design/Data_Model]], [[Design/Auth]], [[Design/Implementation_Phases]].
8
9
---
10
11
## Infrastructure cost model
12
13
> **Superseded.** This page describes AWS infrastructure costs, CI/CD, and operational procedures for wikibot.io. See [[Design/VPS_Architecture]] for the current plan (Debian 12 VM, systemd, rsync backups, $0 hosting cost). The wiki bootstrap template, attachment storage concepts, and Git remote access design carry forward.
14
15
Fixed monthly costs regardless of user count:
16
17
Fixed monthly costs by phase (dev environment, 1 AZ):
18
19
| Phase | What's added | New fixed cost | Cumulative |
20
|-------|-------------|---------------|------------|
21
| 0–3 | EFS + VPC + gateway endpoints (DynamoDB, S3) + Route 53 | $0.50 | **~$0.50/mo** |
22
| 4 (pre-launch) | Secrets Manager endpoint (secrets out of env vars), WAF | ~$12-17 | **~$13-18/mo** |
23
| 5+ (premium) | (no new endpoints — semantic search uses DynamoDB Streams + local MiniLM) | ~$0 | **~$13-18/mo** |
24
| When needed | WorkOS custom domain | $99 | +$99/mo |
25
26
Phases 0–3 use Pulumi-managed environment variables for secrets (redeploy to rotate). Secrets Manager is introduced pre-launch when rotation without redeployment and audit trails matter.
27
28
Production (2 AZs) doubles interface endpoint costs: ~$1/mo (Phases 0–3), ~$26-36/mo (Phase 4+).
29
30
Items that are always free or near-zero:
31
- EFS storage: $0.016/GB/mo IA — pennies at low data volume
32
- VPC itself: $0 (subnets, security groups, route tables have no hourly cost)
33
- DynamoDB on-demand: pay per request, negligible at low traffic
34
- Lambda: scales to zero
35
- CloudFront: free tier covers light traffic
36
- WorkOS: free to 1M MAU
37
38
Variable costs scale with usage:
39
40
| Item | Cost | At 1K users | At 10K users |
41
|------|------|------------|-------------|
42
| Lambda | $0.20/1M requests | ~$0.20 | ~$2 |
43
| DynamoDB on-demand | $1.25/1M writes, $0.25/1M reads | ~$1 | ~$5 |
44
| EFS IA storage | $0.016/GB/month | ~$0.03 | ~$0.30 |
45
| CloudFront | Free tier covers 1TB/mo | ~$0 | ~$0 |
46
| Bedrock | N/A (eliminated) | $0 | $0 |
47
| WorkOS | $0 to 1M MAU | $0 | $0 |
48
49
**Why it's not zero**: EFS requires Lambda to run in a VPC. EFS itself is accessed via mount targets in the VPC (no endpoint needed). But VPC Lambda can't reach other AWS services (DynamoDB, S3) over the public internet — it needs either a NAT Gateway ($32/mo — too expensive) or VPC endpoints. Gateway endpoints (DynamoDB, S3) are free. Interface endpoints (Secrets Manager) cost ~$7/mo/AZ — only needed when Secrets Manager is introduced pre-launch (Phase 4). Bedrock and SQS endpoints were originally planned but have been eliminated by switching to DynamoDB Streams and local MiniLM embeddings.
50
51
**Bottom line**: ~$0.50/mo at rest in Phase 0. ~$13-18/mo from Phase 4 (Secrets Manager endpoint + WAF). No further increase for premium features. "Near-zero cost at rest" is accurate.
52
53
---
54
55
## Wiki Bootstrap Template
56
57
When a user creates a new wiki, the repo is initialized with a starter page set that teaches Claude how to use the wiki effectively. This is the onboarding experience — the user connects MCP, starts a conversation, and Claude already knows the conventions.
58
59
### Initial pages
60
61
**Home** — Landing page with the wiki's name and purpose (user-provided at creation), links to the guide and any starter pages.
62
63
**Meta/Wiki Usage Guide** — Instructions for the AI assistant:
64
- Available MCP tools and what they do
65
- Session start protocol (read Home first, then check recent changes)
66
- Page conventions: frontmatter schema, WikiLink syntax, page size guidance (~250–800 words)
67
- Commit message format
68
- When to create new pages vs. update existing ones
69
- How to use categories and tags
70
- Gardening responsibilities (orphan detection, stale page review, link maintenance)
71
72
**Meta/Page Template** — A reference page showing the frontmatter schema, section structure, and WikiLink usage. Claude can copy this pattern when creating new pages.
73
74
### Customization
75
76
The bootstrap template is parameterized by:
77
- Wiki name (provided at creation)
78
- Wiki purpose/description (optional, provided at creation)
79
- Category set (default set provided, user can customize later)
80
81
The default category set matches the existing schema (`actor`, `event`, `trend`, `hypothesis`, `variable`, `reference`, `index`) but users can define their own categories for different research domains.
82
83
### Custom template repos (premium)
84
85
Premium users can create a wiki from any public (or authenticated) Git repo URL. The Lambda clones the template repo, strips its git history, and commits the contents as the wiki's initial state. This enables:
86
87
- Shared team templates ("our standard research wiki layout")
88
- Domain-specific starter kits (e.g., a policy analysis template, a technical due diligence template)
89
- Community-contributed templates (a future marketplace opportunity)
90
91
### Implementation
92
93
The default template is stored on EFS (or bundled with the Lambda deployment). On wiki creation, the Lambda copies it to the new repo directory, performs string substitution for wiki name/purpose, and commits as the initial state. For custom templates (premium), the Lambda clones from the provided Git URL instead.
94
95
---
96
97
## Attachment Storage
98
99
Otterwiki stores attachments as regular files in the git repo and serves them directly from the working tree. With EFS, this works the same as on a VPS — no clone overhead, attachments are just files on a mounted filesystem.
100
101
### MVP approach
102
103
Store attachments in the git repo as-is. Tier limits (50MB free, 1GB premium) keep repo sizes manageable. EFS serves files directly with no performance concern.
104
105
### Future optimization: S3-backed attachment serving
106
107
If large attachments become a problem (EFS cost, Git remote clone times), decouple attachment storage from the git repo:
108
109
1. On upload: store the attachment in S3 at a known path (`s3://attachments/{user}/{wiki}/{filename}`), commit only a lightweight reference file to git (similar to Git LFS pointer format)
110
2. On serve: intercept Otterwiki's attachment serving path, resolve the reference, and redirect to S3 (or serve via CloudFront)
111
112
This could be implemented as:
113
- **Otterwiki plugin** that hooks into the attachment upload/serve lifecycle
114
- **Upstream patch** to Otterwiki adding a pluggable storage backend for attachments (local filesystem vs. S3 vs. LFS)
115
- **Lambda middleware** that intercepts attachment routes and serves from S3 before Otterwiki sees the request
116
117
The plugin or upstream patch approach is preferable — it benefits the broader Otterwiki community and keeps our fork minimal.
118
119
---
120
121
## Git Remote Access
122
123
Every wiki's bare repo is directly accessible via Git protocol over HTTPS. This is a core feature, not an afterthought — users should never feel locked in.
124
125
### Hosted Git remote
126
127
```
128
https://sderle.wikibot.io/third-gulf-war.git
129
```
130
131
Authentication: OAuth JWT or MCP bearer token via Git credential helper, or a dedicated Git access token (simpler for CLI usage).
132
133
**Free tier**: read-only. Users can `git clone` and `git pull` their wiki at any time. This is a data portability guarantee — your wiki is always yours.
134
135
**Premium tier**: read/write. Users can `git push` to the hosted remote, enabling workflows like local editing, CI/CD integration, or scripted bulk imports.
136
137
### Implementation
138
139
API Gateway route (`/{user}/{wiki}.git/*`) → Lambda implementing Git smart HTTP protocol (`git-upload-pack` for clone/fetch, `git-receive-pack` for push). The Lambda accesses the same EFS-mounted repo as the wiki handlers.
140
141
This is a well-documented protocol — the Lambda needs to handle the handful of Git smart HTTP endpoints (`/info/refs`, `/git-upload-pack`, `/git-receive-pack`). Libraries like dulwich can serve these without shelling out to git.
142
143
### External Git sync (premium, future)
144
145
Bidirectional sync with an external remote (GitHub, GitLab, etc.). A separate Lambda triggered on schedule (EventBridge) or webhook:
146
147
1. Open wiki repo on EFS
148
2. `git fetch` from configured external remote
149
3. Attempt fast-forward merge (no conflicts → auto-merge)
150
4. Conflicts → flag for human resolution, do not auto-merge
151
5. Push merged state to external remote
152
6. Trigger re-embedding if semantic search enabled
153
154
Credentials for external remotes stored in Secrets Manager (per-wiki secret, auto-rotation support). This is a future feature — the hosted Git remote is the MVP.
155
156
---
157
158
## Platform: AWS Lambda + EFS
159
160
### Why AWS + EFS
161
162
EFS (Elastic File System) is AWS's managed NFS service. Lambda mounts EFS volumes directly, eliminating the git-on-S3 clone/push cycle. Git repos live on a persistent filesystem — Lambda reads/writes them in place, just like a VPS. Combined with AWS's managed service catalog (DynamoDB, DynamoDB Streams, CloudFront, ACM) and first-class Pulumi support, this is the strongest fit.
163
164
### Key properties
165
166
- **Git repos on EFS work like local disk** — no clone, no push, no S3 sync, no write locks
167
- EFS Infrequent Access: $0.016/GB/month — a 2MB wiki costs ~$0.00003/month at rest
168
- NFS handles concurrent access natively (git's `index.lock` works on NFS)
169
- Lambda scales to zero; EFS cost is storage-only when idle
170
- Built-in backup via AWS Backup (to S3)
171
172
### VPC networking
173
174
EFS requires Lambda to run in a VPC. VPC Lambda can't reach AWS services (DynamoDB, SQS, Bedrock, S3) over the public internet — it needs either a NAT Gateway ($32/mo minimum, kills "zero cost at rest") or VPC endpoints.
175
176
- **Gateway endpoints** (free): DynamoDB, S3 — route traffic through the VPC route table
5d7d28 Claude (MCP) 2026-03-20 21:48:37
[mcp] Archive superseded DynamoDB Streams embedding pipeline
177
- **Interface endpoints** (~$7/mo each per AZ): Secrets Manager — ENI-based, billed hourly + per-GB. SQS and Bedrock endpoints are no longer needed (semantic search uses DynamoDB Streams + local MiniLM embeddings; see [[Archive/AWS_Design/Async_Embedding_Pipeline]])
07c84f Claude (MCP) 2026-03-20 19:54:10
[mcp] Archive original AWS-era Design/Operations page
178
- Minimize AZs in dev (1 AZ = 1× endpoint cost); prod needs 2 AZs for availability
179
180
This is a Phase 0 infrastructure requirement — without endpoints, Lambda can mount EFS but can't reach DynamoDB for ACL checks.
181
182
### Known trade-offs
183
184
- EFS requires a VPC → ~1–2s added to Lambda cold starts (Provisioned Concurrency is available if this proves unacceptable — ~$10-15/mo for 1 warm instance)
185
- EFS latency (~1–5ms per op) is higher than local disk but adequate for git
186
- Mangum adapter needed for Flask on Lambda
187
- API Gateway 29s timeout limits long operations
188
- VPC interface endpoints add fixed monthly cost (~$7-14/mo in prod for Secrets Manager across 2 AZs)
189
190
### S3 fallback
191
192
If Phase 0 testing shows EFS latency or VPC cold starts are unacceptable, fall back to S3-based repos with DynamoDB write locks and clone-to-/tmp. Adds significant complexity (locking, cache management, /tmp eviction) — last resort only.
193
194
### Alternatives considered
195
196
**Fly.io Machines**: Persistent volumes, unmodified Otterwiki, simplest architecture. But weaker IaC, no managed services (embeddings, queues, metadata). Simplest fallback if EFS fails.
197
198
**Google Cloud Run**: Cloud Storage FUSE less proven than EFS for filesystem workloads. No clear advantage over AWS+EFS. Less familiar territory.
199
200
### Phase 0 validates this decision
201
202
Exit criteria: page read <500ms warm, page write <1s warm, cold start <5s total.
203
204
---
205
206
## Infrastructure as Code
207
208
All infrastructure is managed declaratively. No manual console clicking, no snowflake configuration.
209
210
### Tool: Pulumi (Python)
211
212
Pulumi with the Python SDK is the primary IaC tool. Rationale:
213
- Application is Python, so infrastructure code in the same language reduces context switching
214
- Pulumi has first-class AWS support (Lambda, EFS, API Gateway, DynamoDB, ACM, CloudFront, VPC)
215
- Full programming language (loops, conditionals, abstractions) vs. HCL's declarative constraints
216
- Strong secret management built in (`pulumi config set --secret`)
217
218
### What's managed by IaC
219
220
Everything that isn't application code:
221
- Lambda functions (compute, async handlers)
222
- EFS filesystem and mount targets
223
- VPC, subnets, security groups, VPC endpoints (gateway + interface)
224
- API Gateway and CloudFront distributions
225
- WAF web ACLs and rule groups
226
- DynamoDB tables
227
- ACM certificates and Route 53 DNS records
228
- IAM roles and policies
229
- SQS queues
230
- Secrets Manager secrets
231
- EventBridge schedules
232
- CloudWatch monitoring, alerting, and billing alarms
233
- X-Ray tracing configuration
234
- Auth provider configuration (WorkOS AuthKit)
235
- Stripe webhook endpoints
236
237
### What's NOT managed by IaC
238
239
- Application code (deployed via CI/CD pipeline, not Pulumi)
240
- User data (wiki repos, DynamoDB records)
241
- Secrets from external services (Stripe API keys, auth provider secrets) — stored in Pulumi config (`pulumi config set --secret`) and injected as Lambda environment variables during development. Migrated to AWS Secrets Manager pre-launch (Phase 4) for rotation without redeployment and audit trails. The VPC interface endpoint for Secrets Manager is only needed at that point.
242
243
### Repository structure
244
245
```
246
wikibot-io/
247
infra/ # Pulumi project
248
__main__.py # Infrastructure definition
249
config/
250
dev.yaml # Dev stack config
251
prod.yaml # Prod stack config
252
app/ # Application code
253
otterwiki/ # Fork (git submodule or subtree)
254
api/ # REST API handlers
255
mcp/ # MCP server handlers
256
management/ # User/wiki management API
257
frontend/ # SPA
258
deploy/ # CI/CD pipeline definitions
259
```
260
261
---
262
263
## Otterwiki Fork Management
264
265
The Otterwiki fork is kept as minimal as possible. All customizations are either:
266
1. **Plugins** (preferred) — no core changes needed
267
2. **Small, upstreamable patches** — contributed to `schuyler/otterwiki` and submitted as PRs to the upstream `redimp/otterwiki` project
268
3. **Platform-specific overrides** — admin panel section hiding, template conditionals (kept in a separate branch or patch set)
269
270
### Merge strategy
271
272
- Track upstream `redimp/otterwiki` as a remote
273
- Periodically rebase or merge upstream changes into the fork
274
- Keep platform-specific changes isolated (ideally a thin layer on top, not interleaved with upstream code)
275
- Automated CI check: does the fork still pass upstream's test suite after merge?
276
277
### Upstream relationship
278
279
We want to support Otterwiki as a project. Contributions go upstream where possible. If the product generates revenue, donate a portion to the upstream maintainer.
280
281
---
282
283
## Backup and Disaster Recovery
284
285
### What we're protecting
286
287
| Data | Source of truth | Severity of loss |
288
|------|----------------|-----------------|
289
| Git repos (wiki content) | EFS | **Critical** — user data, irreplaceable |
290
| DynamoDB (users, wikis, ACLs) | DynamoDB | **High** — reconstructable from repos but painful |
291
| FAISS indexes | EFS | **Low** — fully rebuildable from repo content |
292
| Auth provider state | WorkOS (external) | **Low** — managed by vendor |
293
294
### Backup strategy
295
296
**Git repos (EFS)**: AWS Backup with daily snapshots, 30-day retention. EFS supports point-in-time recovery via backup. Cost: negligible for small repos.
297
298
**DynamoDB**: Point-in-Time Recovery (PITR) — continuous backups, restore to any second in the last 35 days. Cost: ~$0.20/GB/month (pennies for our data volume).
299
300
**FAISS indexes**: No backup needed. Rebuildable from repo content via the embedding Lambda (MiniLM runs locally, no API cost). Loss means a one-time re-embedding of all pages — seconds of Lambda compute per wiki.
301
302
### Recovery scenarios
303
304
| Scenario | Recovery path | RPO | RTO |
305
|----------|--------------|-----|-----|
306
| Single wiki repo corrupted | Restore from EFS backup snapshot | 24h (daily backup) | Minutes |
307
| Bad push overwrites repo | Restore from EFS backup snapshot | 24h | Minutes |
308
| DynamoDB corruption | PITR restore | Seconds | Minutes |
309
| DynamoDB total loss | PITR restore; worst case reconstruct from EFS repo inventory | Seconds | Hours |
310
| FAISS index lost | Re-embed all pages for affected wiki | N/A (rebuildable) | Minutes per wiki |
311
| Full region outage | Accept downtime | N/A | Depends on provider recovery |
312
313
### Design principle
314
315
Git repos are the source of truth. Everything else (DynamoDB records, FAISS indexes) is either backed up with PITR or rebuildable from the repos. A DynamoDB wipe is painful but survivable — you can walk the EFS filesystem and reconstruct user/wiki records from the repo inventory.
316
317
---
318
319
## CI/CD
320
321
Code lives in a private GitHub repo. Deployment via GitHub Actions.
322
323
### Pipeline
324
325
```
326
git push to main
327
→ GitHub Actions:
328
1. Run tests (pytest for Python, vitest/jest for frontend)
329
2. Build artifacts (Lambda zip or container image, SPA bundle)
330
3. Deploy infrastructure changes (pulumi up)
331
4. Deploy Lambda code (zip upload or ECR image push)
332
5. Smoke test (hit health endpoint, create/read/delete a test page)
333
```
334
335
### Environment strategy
336
337
- **dev**: auto-deploy on push to `main`. Separate infrastructure stack (`pulumi stack select dev`). Own domain (`dev.wikibot.io`).
338
- **prod**: manual promotion (GitHub Actions workflow dispatch or tag-based). Separate Pulumi stack. `wikibot.io`.
339
340
---
341
342
## Account Lifecycle
343
344
### Data retention
345
346
User accounts and wiki data are retained indefinitely regardless of activity. Storage cost for an idle wiki is effectively zero (a few KB in DynamoDB, a few MB of git repo on EFS Infrequent Access). There is no reason to delete inactive accounts — it costs nothing to keep them and deleting user data is irreversible.
347
348
### Account deletion
349
350
Users can delete their account from the dashboard. This:
351
1. Deletes all wikis owned by the user (repo, FAISS index, metadata)
352
2. Removes all ACL grants the user has on other wikis
353
3. Deletes the user record from the DynamoDB
354
4. Does NOT delete the auth provider account (Google/GitHub/etc.) — that's the user's own account
355
356
Deletion is permanent and irreversible. Require explicit confirmation ("type your username to confirm").
357
358
### GDPR
359
360
If serving EU users: account deletion satisfies right-to-erasure. Add a data export endpoint (download all wikis as a zip of git repos) to satisfy right-to-portability — though the Git remote access feature already provides this.
361
362
---
363
364
## MCP Discoverability
365
366
MCP tool descriptions must be self-documenting — any MCP-capable client (Claude, GPT, Gemini, open-source agents) should be able to use the wiki tools without reading external documentation.
367
368
Each tool's MCP description should include:
369
- What it does
370
- Parameter semantics (e.g., "path is like `Actors/Iran`, not a filesystem path")
371
- What the return format looks like
372
- Common next actions ("use `list_notes` to find available pages if you don't know the path")
373
374
The bootstrap template's Meta/Wiki Usage Guide provides Claude-specific conventions (session protocol, gardening duties), but the MCP tools themselves should work without it. The guide is optimization, not a prerequisite.
375
376
---
377
378
## Rate Limiting and Abuse Prevention
379
380
**Launch**: OAuth-only accounts + tier limits (1 wiki, 500 pages, 3 collaborators) provide sufficient abuse prevention at low traffic. Public wiki routes are the only unauthenticated surface — acceptable risk at launch with near-zero users.
381
382
**Post-launch (when traffic justifies it)**: AWS WAF on API Gateway and CloudFront. IP-based rate limiting, geographic blocking, bot control, OWASP Top 10 managed rule sets. Adds ~$5-10/mo. Deploy when there's real traffic to protect.
383
384
**Per-user rate limiting (premium launch)**: When premium tier ships, add per-user throttling on API and MCP endpoints via API Gateway usage plans or WAF custom rules. Define specific limits when the need materializes.
385
386
---
387
388
## Open Questions
389
390
1. **EFS + Lambda performance**: The key Phase 0 question. Does EFS latency for git operations meet targets (<500ms read, <1s write warm)? Does VPC cold start stay under 5s total?
391
392
2. **Otterwiki on Lambda feasibility**: Otterwiki has filesystem assumptions beyond the git repo (config files, static assets). How much Mangum adaptation is needed? EFS satisfies most filesystem assumptions, but Flask-on-Lambda via Mangum still requires testing.
393
394
3. **Lambda package size**: Otterwiki + gitpython + FAISS + FastMCP + Mangum. If over 250MB zip limit, use Lambda container images (up to 10GB).
395
396
4. **Git library choice**: gitpython shells out to `git` (binary dependency — verify availability in Lambda runtime). dulwich is pure Python (no binary, different API, possibly slower). dulwich avoids the binary question entirely.
397
398
5. **MCP Streamable HTTP timeouts**: API Gateway caps at 29s. Most MCP operations complete in <5s, but semantic search with embedding generation could approach 10–15s. Verify this isn't a problem.
399
400
6. **Platform JWT signing key management**: RS256 keypair in Secrets Manager. Need to define key rotation strategy — do we support multiple valid keys (JWKS with `kid` header) for zero-downtime rotation, or is manual rotation with a maintenance window acceptable for MVP?
401
402
7. **WorkOS + FastMCP integration on Lambda**: The FastMCP WorkOS integration is documented but needs validation in our specific setup (Lambda + API Gateway + VPC). Known friction points: `client_secret_basic` default may conflict with some MCP clients, no RFC 8707 resource indicator support. Validate in Phase 0.
403
404
8. **Apple provider sub retrieval**: WorkOS exposes raw OAuth provider `sub` claims via API for Google, GitHub, Microsoft. Apple is undocumented. If we can't get Apple's raw sub, Apple users can't be migrated off WorkOS without re-authenticating. Verify in Phase 0.
405
406
9. **Otterwiki licensing**: MIT licensed — permissive, should be fine for commercial use. Confirm no additional contributor agreements or trademark restrictions.
407
5d7d28 Claude (MCP) 2026-03-20 21:48:37
[mcp] Archive superseded DynamoDB Streams embedding pipeline
408
10. **VPC endpoint costs** (resolved): SQS and Bedrock interface endpoints have been eliminated by switching to DynamoDB Streams (for async reindexing) and local MiniLM embeddings (for semantic search). The only remaining interface endpoint is Secrets Manager (~$7/mo/AZ, introduced in Phase 4). See [[Archive/AWS_Design/Async_Embedding_Pipeline]].