commit 5f53cb

Commit `5f53cb`

2026-03-13 01:49:23 Claude (Dev): [mcp] Port PRD data model to wiki

0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9

`/dev/null` .. `design/prd data model.md`
@@ 0,0 1,184 @@
+	This page is part of the wikibot.io PRD (Product Requirements Document). See also: [[Design/PRD Overview]], [[Design/PRD Auth]], [[Design/PRD Phases]], [[Design/PRD Operations]].
+
+	---
+
+	## Data Model
+
+	DynamoDB tables. Partition keys noted in comments.
+
+	#### Users
+
+	```
+	User {
+	id: string, // platform-generated (UUID)
+	email: string,
+	display_name: string,
+	oauth_provider: string, // "google" \| "github" \| "microsoft" \| "apple"
+	oauth_provider_sub: string, // provider-native subject ID (e.g., Google sub claim)
+	// GSI on (oauth_provider, oauth_provider_sub) for login lookup
+	// Critical: enables migration off WorkOS or any auth provider
+	tier: "free" \| "premium",
+	created_at: ISO8601,
+	wiki_count: number,
+	wiki_limit: number, // 1 for free, 10 for premium
+	stripe_customer_id?: string
+	}
+	```
+
+	#### Wikis
+
+	```
+	Wiki {
+	owner_id: string, // User.id
+	wiki_slug: string, // URL-safe identifier
+	display_name: string,
+	repo_path: string, // EFS path: /mnt/efs/{user_id}/{wiki_slug}/repo.git
+	index_path?: string, // FAISS index location (premium only)
+	mcp_token_hash: string, // bcrypt hash of MCP bearer token
+	is_public: boolean, // read-only public access
+	created_at: ISO8601,
+	last_accessed: ISO8601,
+	page_count: number,
+	semantic_search_enabled: boolean,
+	custom_domain?: string, // premium: CNAME target
+	custom_css?: string, // premium: custom styling
+	external_git_remote?: string // premium: sync target
+	}
+	```
+
+	#### ACLs
+
+	```
+	ACL {
+	wiki_id: string, // owner_id + wiki_slug
+	grantee_id: string, // User.id
+	role: "owner" \| "editor" \| "viewer",
+	granted_by: string,
+	granted_at: ISO8601
+	}
+	```
+
+	### Storage layout (EFS)
+
+	```
+	/mnt/efs/
+	{user_id}/
+	{wiki_slug}/
+	repo.git/ # bare git repo — persistent filesystem
+	index.faiss # FAISS vector index (premium only)
+	embeddings.json # page_path → vector mapping
+	```
+
+	---
+
+	## Git Storage Mechanics
+
+	### EFS-backed git repos
+
+	Each wiki's bare git repo lives on a persistent filesystem mounted by the compute layer. No clone/push cycle, no caching, no locks — git operations happen directly on disk.
+
+	Read path:
+	```
+	1. Lambda mounts EFS (already attached in VPC)
+	2. Open bare repo at /mnt/efs/{user}/{wiki}/repo.git
+	3. Read page from repo
+	```
+
+	Write path:
+	```
+	1. Open bare repo at /mnt/efs/{user}/{wiki}/repo.git
+	2. Commit page change
+	3. If semantic search enabled: enqueue SQS message for reindex
+	```
+
+	Concurrency: NFS handles file-level locking natively. Git's own locking (`index.lock`) works correctly on NFS. Concurrent reads are unlimited. Concurrent writes to the same repo are serialized by git's lock file. No application-level locking needed.
+
+	Consistency: Writes are immediately visible to all Lambda invocations mounting the same EFS filesystem. No eventual consistency concerns.
+
+	### Fallback: S3 clone-on-demand
+
+	If Phase 0 testing shows EFS latency or VPC cold starts are unacceptable, fall back to S3-based repos with a DynamoDB write lock + clone-to-/tmp pattern. This adds significant complexity (locking, cache management, /tmp eviction) and is only worth pursuing if EFS fails testing.
+
+	---
+
+	## Semantic Search (Premium)
+
+	### Embedding pipeline
+
+	```
+	Page write (Lambda)
+	→ SQS message: {user, wiki, page_path, action: "upsert" \| "delete"}
+	→ Embedding Lambda (triggered by SQS):
+	1. Read page content from EFS repo
+	2. Chunk page (same algorithm as existing otterwiki-semantic-search)
+	3. Call Bedrock titan-embed-text-v2 for each chunk
+	4. Load current FAISS index from EFS
+	5. Update index (remove old vectors for page, add new ones)
+	6. Write updated index to EFS
+	7. Update embeddings.json sidecar (page_path → chunk vectors mapping)
+	```
+
+	### FAISS details
+
+	FAISS (Facebook AI Similarity Search) is a C++ library with Python bindings for nearest-neighbor search over dense vectors.
+
+	Index type: `IndexFlatIP` (flat index, inner product similarity). For wikis under ~1000 pages, brute-force search is fast enough (<1ms) and requires no training or tuning. The index is just a matrix of vectors.
+
+	Index size: Each vector is 1536 floats × 4 bytes = 6KB. A 200-page wiki with ~3 chunks per page = 600 vectors = ~3.6MB index. Trivial to store on EFS and load into Lambda memory.
+
+	Sidecar metadata: FAISS stores only vectors and returns integer indices. The `embeddings.json` sidecar maps index positions back to `{page_path, chunk_index, chunk_text_preview}`. This file is loaded alongside the FAISS index.
+
+	Search flow:
+	1. Embed query via Bedrock (~100ms)
+	2. Load FAISS index + sidecar from EFS (~5ms, already mounted)
+	3. Search top K×3 vectors (~<1ms)
+	4. Deduplicate by page_path, keep best chunk per page
+	5. Return top K results with page paths and matching chunk snippets
+
+	### Cost estimate
+
+	- Embedding a 200-page wiki: ~$0.02 (one-time)
+	- Per search query: ~$0.0001 (embed the query)
+	- 100 queries/day: ~$0.30/month
+	- Re-embedding on page edits: negligible
+
+	---
+
+	## URL Structure
+
+	Each user gets a subdomain: `{username}.wikibot.io`
+
+	```
+	sderle.wikibot.io/ → user's wiki list (dashboard)
+	sderle.wikibot.io/third-gulf-war/ → wiki web UI
+	sderle.wikibot.io/third-gulf-war/api/v1/ → wiki REST API
+	sderle.wikibot.io/third-gulf-war/mcp → wiki MCP endpoint
+	```
+
+	### Custom domains (premium)
+
+	Premium users can CNAME their own domain to their `{username}.wikibot.io` subdomain. Implementation: API Gateway custom domain + ACM certificate (free via AWS). The Lambda resolver checks DynamoDB for custom domain → user mapping.
+
+	```
+	research.mysite.com → CNAME → sderle.wikibot.io
+	```
+
+	This requires wildcard routing at the API Gateway level (`*.wikibot.io`) and TLS cert provisioning per custom domain. ACM supports up to 2500 certs per account, which is fine for early scale. At larger scale, CloudFront with SNI handles this better.
+
+	---
+
+	## Usernames
+
+	Each user chooses a username at signup (after OAuth). Usernames are URL-critical (`{username}.wikibot.io`) so they must be:
+
+	- URL-safe: lowercase alphanumeric + hyphens, 3–30 characters, no leading/trailing hyphens
+	- Unique: enforced in DynamoDB
+	- Immutable (MVP): changing usernames means changing URLs, which breaks MCP connections, Git remotes, bookmarks. Defer username changes (with redirect support) to a future iteration.
+	- Reserved: block names that conflict with platform routes or look official: `admin`, `www`, `api`, `auth`, `mcp`, `app`, `help`, `support`, `billing`, `status`, `blog`, `docs`, `robot`, `wiki`, `static`, `assets`, `null`, `undefined`, etc. Maintain a blocklist.
+
+	### Username squatting
+
+	Free accounts cost nothing to create, so squatting is possible. Mitigations:
+	- Require at least one wiki with at least one page edit within 90 days of signup, or the username is released
+	- Trademark disputes handled case-by-case (standard UDRP-like process, documented in ToS)
+	- Not a launch concern — address when it becomes a real problem
\	No newline at end of file

Commit 5f53cb

Commit `5f53cb`