Commit 5f53cb

2026-03-13 01:49:23 Claude (Dev): [mcp] Port PRD data model to wiki
/dev/null .. design/prd data model.md
@@ 0,0 1,184 @@
+ This page is part of the **wikibot.io PRD** (Product Requirements Document). See also: [[Design/PRD Overview]], [[Design/PRD Auth]], [[Design/PRD Phases]], [[Design/PRD Operations]].
+
+ ---
+
+ ## Data Model
+
+ DynamoDB tables. Partition keys noted in comments.
+
+ #### Users
+
+ ```
+ User {
+ id: string, // platform-generated (UUID)
+ email: string,
+ display_name: string,
+ oauth_provider: string, // "google" | "github" | "microsoft" | "apple"
+ oauth_provider_sub: string, // provider-native subject ID (e.g., Google sub claim)
+ // GSI on (oauth_provider, oauth_provider_sub) for login lookup
+ // Critical: enables migration off WorkOS or any auth provider
+ tier: "free" | "premium",
+ created_at: ISO8601,
+ wiki_count: number,
+ wiki_limit: number, // 1 for free, 10 for premium
+ stripe_customer_id?: string
+ }
+ ```
+
+ #### Wikis
+
+ ```
+ Wiki {
+ owner_id: string, // User.id
+ wiki_slug: string, // URL-safe identifier
+ display_name: string,
+ repo_path: string, // EFS path: /mnt/efs/{user_id}/{wiki_slug}/repo.git
+ index_path?: string, // FAISS index location (premium only)
+ mcp_token_hash: string, // bcrypt hash of MCP bearer token
+ is_public: boolean, // read-only public access
+ created_at: ISO8601,
+ last_accessed: ISO8601,
+ page_count: number,
+ semantic_search_enabled: boolean,
+ custom_domain?: string, // premium: CNAME target
+ custom_css?: string, // premium: custom styling
+ external_git_remote?: string // premium: sync target
+ }
+ ```
+
+ #### ACLs
+
+ ```
+ ACL {
+ wiki_id: string, // owner_id + wiki_slug
+ grantee_id: string, // User.id
+ role: "owner" | "editor" | "viewer",
+ granted_by: string,
+ granted_at: ISO8601
+ }
+ ```
+
+ ### Storage layout (EFS)
+
+ ```
+ /mnt/efs/
+ {user_id}/
+ {wiki_slug}/
+ repo.git/ # bare git repo — persistent filesystem
+ index.faiss # FAISS vector index (premium only)
+ embeddings.json # page_path → vector mapping
+ ```
+
+ ---
+
+ ## Git Storage Mechanics
+
+ ### EFS-backed git repos
+
+ Each wiki's bare git repo lives on a persistent filesystem mounted by the compute layer. No clone/push cycle, no caching, no locks — git operations happen directly on disk.
+
+ **Read path:**
+ ```
+ 1. Lambda mounts EFS (already attached in VPC)
+ 2. Open bare repo at /mnt/efs/{user}/{wiki}/repo.git
+ 3. Read page from repo
+ ```
+
+ **Write path:**
+ ```
+ 1. Open bare repo at /mnt/efs/{user}/{wiki}/repo.git
+ 2. Commit page change
+ 3. If semantic search enabled: enqueue SQS message for reindex
+ ```
+
+ **Concurrency**: NFS handles file-level locking natively. Git's own locking (`index.lock`) works correctly on NFS. Concurrent reads are unlimited. Concurrent writes to the same repo are serialized by git's lock file. No application-level locking needed.
+
+ **Consistency**: Writes are immediately visible to all Lambda invocations mounting the same EFS filesystem. No eventual consistency concerns.
+
+ ### Fallback: S3 clone-on-demand
+
+ If Phase 0 testing shows EFS latency or VPC cold starts are unacceptable, fall back to S3-based repos with a DynamoDB write lock + clone-to-/tmp pattern. This adds significant complexity (locking, cache management, /tmp eviction) and is only worth pursuing if EFS fails testing.
+
+ ---
+
+ ## Semantic Search (Premium)
+
+ ### Embedding pipeline
+
+ ```
+ Page write (Lambda)
+ → SQS message: {user, wiki, page_path, action: "upsert" | "delete"}
+ → Embedding Lambda (triggered by SQS):
+ 1. Read page content from EFS repo
+ 2. Chunk page (same algorithm as existing otterwiki-semantic-search)
+ 3. Call Bedrock titan-embed-text-v2 for each chunk
+ 4. Load current FAISS index from EFS
+ 5. Update index (remove old vectors for page, add new ones)
+ 6. Write updated index to EFS
+ 7. Update embeddings.json sidecar (page_path → chunk vectors mapping)
+ ```
+
+ ### FAISS details
+
+ FAISS (Facebook AI Similarity Search) is a C++ library with Python bindings for nearest-neighbor search over dense vectors.
+
+ **Index type**: `IndexFlatIP` (flat index, inner product similarity). For wikis under ~1000 pages, brute-force search is fast enough (<1ms) and requires no training or tuning. The index is just a matrix of vectors.
+
+ **Index size**: Each vector is 1536 floats × 4 bytes = 6KB. A 200-page wiki with ~3 chunks per page = 600 vectors = ~3.6MB index. Trivial to store on EFS and load into Lambda memory.
+
+ **Sidecar metadata**: FAISS stores only vectors and returns integer indices. The `embeddings.json` sidecar maps index positions back to `{page_path, chunk_index, chunk_text_preview}`. This file is loaded alongside the FAISS index.
+
+ **Search flow**:
+ 1. Embed query via Bedrock (~100ms)
+ 2. Load FAISS index + sidecar from EFS (~5ms, already mounted)
+ 3. Search top K×3 vectors (~<1ms)
+ 4. Deduplicate by page_path, keep best chunk per page
+ 5. Return top K results with page paths and matching chunk snippets
+
+ ### Cost estimate
+
+ - Embedding a 200-page wiki: ~$0.02 (one-time)
+ - Per search query: ~$0.0001 (embed the query)
+ - 100 queries/day: ~$0.30/month
+ - Re-embedding on page edits: negligible
+
+ ---
+
+ ## URL Structure
+
+ Each user gets a subdomain: `{username}.wikibot.io`
+
+ ```
+ sderle.wikibot.io/ → user's wiki list (dashboard)
+ sderle.wikibot.io/third-gulf-war/ → wiki web UI
+ sderle.wikibot.io/third-gulf-war/api/v1/ → wiki REST API
+ sderle.wikibot.io/third-gulf-war/mcp → wiki MCP endpoint
+ ```
+
+ ### Custom domains (premium)
+
+ Premium users can CNAME their own domain to their `{username}.wikibot.io` subdomain. Implementation: API Gateway custom domain + ACM certificate (free via AWS). The Lambda resolver checks DynamoDB for custom domain → user mapping.
+
+ ```
+ research.mysite.com → CNAME → sderle.wikibot.io
+ ```
+
+ This requires wildcard routing at the API Gateway level (`*.wikibot.io`) and TLS cert provisioning per custom domain. ACM supports up to 2500 certs per account, which is fine for early scale. At larger scale, CloudFront with SNI handles this better.
+
+ ---
+
+ ## Usernames
+
+ Each user chooses a username at signup (after OAuth). Usernames are URL-critical (`{username}.wikibot.io`) so they must be:
+
+ - **URL-safe**: lowercase alphanumeric + hyphens, 3–30 characters, no leading/trailing hyphens
+ - **Unique**: enforced in DynamoDB
+ - **Immutable** (MVP): changing usernames means changing URLs, which breaks MCP connections, Git remotes, bookmarks. Defer username changes (with redirect support) to a future iteration.
+ - **Reserved**: block names that conflict with platform routes or look official: `admin`, `www`, `api`, `auth`, `mcp`, `app`, `help`, `support`, `billing`, `status`, `blog`, `docs`, `robot`, `wiki`, `static`, `assets`, `null`, `undefined`, etc. Maintain a blocklist.
+
+ ### Username squatting
+
+ Free accounts cost nothing to create, so squatting is possible. Mitigations:
+ - Require at least one wiki with at least one page edit within 90 days of signup, or the username is released
+ - Trademark disputes handled case-by-case (standard UDRP-like process, documented in ToS)
+ - Not a launch concern — address when it becomes a real problem
\ No newline at end of file
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9