Commit 5f53cb
2026-03-13 01:49:23 Claude (Dev): [mcp] Port PRD data model to wiki| /dev/null .. design/prd data model.md | |
| @@ 0,0 1,184 @@ | |
| + | This page is part of the **wikibot.io PRD** (Product Requirements Document). See also: [[Design/PRD Overview]], [[Design/PRD Auth]], [[Design/PRD Phases]], [[Design/PRD Operations]]. |
| + | |
| + | --- |
| + | |
| + | ## Data Model |
| + | |
| + | DynamoDB tables. Partition keys noted in comments. |
| + | |
| + | #### Users |
| + | |
| + | ``` |
| + | User { |
| + | id: string, // platform-generated (UUID) |
| + | email: string, |
| + | display_name: string, |
| + | oauth_provider: string, // "google" | "github" | "microsoft" | "apple" |
| + | oauth_provider_sub: string, // provider-native subject ID (e.g., Google sub claim) |
| + | // GSI on (oauth_provider, oauth_provider_sub) for login lookup |
| + | // Critical: enables migration off WorkOS or any auth provider |
| + | tier: "free" | "premium", |
| + | created_at: ISO8601, |
| + | wiki_count: number, |
| + | wiki_limit: number, // 1 for free, 10 for premium |
| + | stripe_customer_id?: string |
| + | } |
| + | ``` |
| + | |
| + | #### Wikis |
| + | |
| + | ``` |
| + | Wiki { |
| + | owner_id: string, // User.id |
| + | wiki_slug: string, // URL-safe identifier |
| + | display_name: string, |
| + | repo_path: string, // EFS path: /mnt/efs/{user_id}/{wiki_slug}/repo.git |
| + | index_path?: string, // FAISS index location (premium only) |
| + | mcp_token_hash: string, // bcrypt hash of MCP bearer token |
| + | is_public: boolean, // read-only public access |
| + | created_at: ISO8601, |
| + | last_accessed: ISO8601, |
| + | page_count: number, |
| + | semantic_search_enabled: boolean, |
| + | custom_domain?: string, // premium: CNAME target |
| + | custom_css?: string, // premium: custom styling |
| + | external_git_remote?: string // premium: sync target |
| + | } |
| + | ``` |
| + | |
| + | #### ACLs |
| + | |
| + | ``` |
| + | ACL { |
| + | wiki_id: string, // owner_id + wiki_slug |
| + | grantee_id: string, // User.id |
| + | role: "owner" | "editor" | "viewer", |
| + | granted_by: string, |
| + | granted_at: ISO8601 |
| + | } |
| + | ``` |
| + | |
| + | ### Storage layout (EFS) |
| + | |
| + | ``` |
| + | /mnt/efs/ |
| + | {user_id}/ |
| + | {wiki_slug}/ |
| + | repo.git/ # bare git repo — persistent filesystem |
| + | index.faiss # FAISS vector index (premium only) |
| + | embeddings.json # page_path → vector mapping |
| + | ``` |
| + | |
| + | --- |
| + | |
| + | ## Git Storage Mechanics |
| + | |
| + | ### EFS-backed git repos |
| + | |
| + | Each wiki's bare git repo lives on a persistent filesystem mounted by the compute layer. No clone/push cycle, no caching, no locks — git operations happen directly on disk. |
| + | |
| + | **Read path:** |
| + | ``` |
| + | 1. Lambda mounts EFS (already attached in VPC) |
| + | 2. Open bare repo at /mnt/efs/{user}/{wiki}/repo.git |
| + | 3. Read page from repo |
| + | ``` |
| + | |
| + | **Write path:** |
| + | ``` |
| + | 1. Open bare repo at /mnt/efs/{user}/{wiki}/repo.git |
| + | 2. Commit page change |
| + | 3. If semantic search enabled: enqueue SQS message for reindex |
| + | ``` |
| + | |
| + | **Concurrency**: NFS handles file-level locking natively. Git's own locking (`index.lock`) works correctly on NFS. Concurrent reads are unlimited. Concurrent writes to the same repo are serialized by git's lock file. No application-level locking needed. |
| + | |
| + | **Consistency**: Writes are immediately visible to all Lambda invocations mounting the same EFS filesystem. No eventual consistency concerns. |
| + | |
| + | ### Fallback: S3 clone-on-demand |
| + | |
| + | If Phase 0 testing shows EFS latency or VPC cold starts are unacceptable, fall back to S3-based repos with a DynamoDB write lock + clone-to-/tmp pattern. This adds significant complexity (locking, cache management, /tmp eviction) and is only worth pursuing if EFS fails testing. |
| + | |
| + | --- |
| + | |
| + | ## Semantic Search (Premium) |
| + | |
| + | ### Embedding pipeline |
| + | |
| + | ``` |
| + | Page write (Lambda) |
| + | → SQS message: {user, wiki, page_path, action: "upsert" | "delete"} |
| + | → Embedding Lambda (triggered by SQS): |
| + | 1. Read page content from EFS repo |
| + | 2. Chunk page (same algorithm as existing otterwiki-semantic-search) |
| + | 3. Call Bedrock titan-embed-text-v2 for each chunk |
| + | 4. Load current FAISS index from EFS |
| + | 5. Update index (remove old vectors for page, add new ones) |
| + | 6. Write updated index to EFS |
| + | 7. Update embeddings.json sidecar (page_path → chunk vectors mapping) |
| + | ``` |
| + | |
| + | ### FAISS details |
| + | |
| + | FAISS (Facebook AI Similarity Search) is a C++ library with Python bindings for nearest-neighbor search over dense vectors. |
| + | |
| + | **Index type**: `IndexFlatIP` (flat index, inner product similarity). For wikis under ~1000 pages, brute-force search is fast enough (<1ms) and requires no training or tuning. The index is just a matrix of vectors. |
| + | |
| + | **Index size**: Each vector is 1536 floats × 4 bytes = 6KB. A 200-page wiki with ~3 chunks per page = 600 vectors = ~3.6MB index. Trivial to store on EFS and load into Lambda memory. |
| + | |
| + | **Sidecar metadata**: FAISS stores only vectors and returns integer indices. The `embeddings.json` sidecar maps index positions back to `{page_path, chunk_index, chunk_text_preview}`. This file is loaded alongside the FAISS index. |
| + | |
| + | **Search flow**: |
| + | 1. Embed query via Bedrock (~100ms) |
| + | 2. Load FAISS index + sidecar from EFS (~5ms, already mounted) |
| + | 3. Search top K×3 vectors (~<1ms) |
| + | 4. Deduplicate by page_path, keep best chunk per page |
| + | 5. Return top K results with page paths and matching chunk snippets |
| + | |
| + | ### Cost estimate |
| + | |
| + | - Embedding a 200-page wiki: ~$0.02 (one-time) |
| + | - Per search query: ~$0.0001 (embed the query) |
| + | - 100 queries/day: ~$0.30/month |
| + | - Re-embedding on page edits: negligible |
| + | |
| + | --- |
| + | |
| + | ## URL Structure |
| + | |
| + | Each user gets a subdomain: `{username}.wikibot.io` |
| + | |
| + | ``` |
| + | sderle.wikibot.io/ → user's wiki list (dashboard) |
| + | sderle.wikibot.io/third-gulf-war/ → wiki web UI |
| + | sderle.wikibot.io/third-gulf-war/api/v1/ → wiki REST API |
| + | sderle.wikibot.io/third-gulf-war/mcp → wiki MCP endpoint |
| + | ``` |
| + | |
| + | ### Custom domains (premium) |
| + | |
| + | Premium users can CNAME their own domain to their `{username}.wikibot.io` subdomain. Implementation: API Gateway custom domain + ACM certificate (free via AWS). The Lambda resolver checks DynamoDB for custom domain → user mapping. |
| + | |
| + | ``` |
| + | research.mysite.com → CNAME → sderle.wikibot.io |
| + | ``` |
| + | |
| + | This requires wildcard routing at the API Gateway level (`*.wikibot.io`) and TLS cert provisioning per custom domain. ACM supports up to 2500 certs per account, which is fine for early scale. At larger scale, CloudFront with SNI handles this better. |
| + | |
| + | --- |
| + | |
| + | ## Usernames |
| + | |
| + | Each user chooses a username at signup (after OAuth). Usernames are URL-critical (`{username}.wikibot.io`) so they must be: |
| + | |
| + | - **URL-safe**: lowercase alphanumeric + hyphens, 3–30 characters, no leading/trailing hyphens |
| + | - **Unique**: enforced in DynamoDB |
| + | - **Immutable** (MVP): changing usernames means changing URLs, which breaks MCP connections, Git remotes, bookmarks. Defer username changes (with redirect support) to a future iteration. |
| + | - **Reserved**: block names that conflict with platform routes or look official: `admin`, `www`, `api`, `auth`, `mcp`, `app`, `help`, `support`, `billing`, `status`, `blog`, `docs`, `robot`, `wiki`, `static`, `assets`, `null`, `undefined`, etc. Maintain a blocklist. |
| + | |
| + | ### Username squatting |
| + | |
| + | Free accounts cost nothing to create, so squatting is possible. Mitigations: |
| + | - Require at least one wiki with at least one page edit within 90 days of signup, or the username is released |
| + | - Trademark disputes handled case-by-case (standard UDRP-like process, documented in ToS) |
| + | - Not a launch concern — address when it becomes a real problem |
| \ | No newline at end of file |