This page is part of the **wikibot.io PRD** (Product Requirements Document). See also: [[Design/Platform_Overview]], [[Design/Auth]], [[Design/Implementation_Phases]], [[Design/Operations]]. --- ## Data Model > **Superseded.** This page describes the DynamoDB/EFS data model for wikibot.io. See [[Design/VPS_Architecture]] for the current plan (SQLite, local disk). The ACL model and storage layout concepts carry forward; the DynamoDB-specific schema does not. DynamoDB tables. Partition keys noted in comments. #### Users ``` User { id: string, // platform-generated (UUID) email: string, display_name: string, oauth_provider: string, // "google" | "github" | "microsoft" | "apple" oauth_provider_sub: string, // provider-native subject ID (e.g., Google sub claim) // GSI on (oauth_provider, oauth_provider_sub) for login lookup // Critical: enables migration off WorkOS or any auth provider created_at: ISO8601, wiki_count: number, stripe_customer_id?: string } ``` Note: the User model is deliberately thin on pricing fields. Under Option A (flat tier), add `tier: "free" | "premium"` and `wiki_limit: number`. Under Option B (per-wiki), no tier field is needed — billing state lives on each Wiki record. See [[Design/Implementation_Phases]] for pricing options. #### Wikis ``` Wiki { owner_id: string, // User.id wiki_slug: string, // URL-safe identifier (under user namespace) custom_slug?: string, // paid wikis: top-level slug for {slug}.wikibot.io display_name: string, repo_path: string, // EFS path: /mnt/efs/{user_id}/{wiki_slug}/repo.git index_path?: string, // FAISS index location (on EFS alongside repo) mcp_token_hash: string, // bcrypt hash of MCP bearer token is_public: boolean, // read-only public access is_paid: boolean, // whether this wiki requires payment (i.e., not the free wiki) payment_status: "active" | "lapsed" | "free", // free = the user's one free wiki // active = paid and current // lapsed = payment failed/canceled → read-only, MCP disabled created_at: ISO8601, last_accessed: ISO8601, page_count: number, } ``` #### ACLs ``` ACL { wiki_id: string, // owner_id + wiki_slug grantee_id: string, // User.id role: "owner" | "editor" | "viewer", granted_by: string, granted_at: ISO8601 } ``` ### Storage layout (EFS) ``` /mnt/efs/ {user_id}/ {wiki_slug}/ repo.git/ # bare git repo — persistent filesystem index.faiss # FAISS vector index embeddings.json # page_path → vector mapping ``` --- ## Git Storage Mechanics ### EFS-backed git repos Each wiki's bare git repo lives on a persistent filesystem mounted by the compute layer. No clone/push cycle, no caching, no locks — git operations happen directly on disk. **Read path:** ``` 1. Lambda mounts EFS (already attached in VPC) 2. Open bare repo at /mnt/efs/{user}/{wiki}/repo.git 3. Read page from repo ``` **Write path:** ``` 1. Open bare repo at /mnt/efs/{user}/{wiki}/repo.git 2. Commit page change 3. Write reindex record to DynamoDB ReindexQueue table (triggers embedding Lambda via DynamoDB Streams — see Semantic Search section) ``` **Concurrency**: NFS handles file-level locking natively. Git's own locking (`index.lock`) works correctly on NFS. Concurrent reads are unlimited. Concurrent writes to the same repo are serialized by git's lock file. No application-level locking needed. **Consistency**: Writes are immediately visible to all Lambda invocations mounting the same EFS filesystem. No eventual consistency concerns. ### Fallback: S3 clone-on-demand If Phase 0 testing shows EFS latency or VPC cold starts are unacceptable, fall back to S3-based repos with a DynamoDB write lock + clone-to-/tmp pattern. This adds significant complexity (locking, cache management, /tmp eviction) and is only worth pursuing if EFS fails testing. --- ## Semantic Search Semantic search is available to all users (not tier-gated). See [[Design/Async_Embedding_Pipeline]] for the full architecture. ### Embedding pipeline (summary) ``` Page write (wiki Lambda, VPC) → DynamoDB write to ReindexQueue table (free gateway endpoint, already deployed) → DynamoDB Streams captures the change → Lambda service polls the stream (outside function's VPC context) → Embedding Lambda (VPC, EFS mount): 1. Read page content from EFS repo 2. Chunk page (same algorithm as otterwiki-semantic-search) 3. Embed chunks using all-MiniLM-L6-v2 (runs locally, no external API) 4. Update FAISS index + sidecar metadata on EFS ``` No Bedrock, no SQS, no new VPC endpoints. Total fixed cost: $0. ### FAISS details FAISS (Facebook AI Similarity Search) is a C++ library with Python bindings for nearest-neighbor search over dense vectors. **Index type**: `IndexFlatIP` (flat index, inner product similarity). For wikis under ~1000 pages, brute-force search is fast enough (<1ms) and requires no training or tuning. The index is just a matrix of vectors. **Index size**: Each MiniLM vector is 384 floats × 4 bytes = 1.5KB. A 200-page wiki with ~3 chunks per page = 600 vectors = ~900KB index. Trivial to store on EFS and load into Lambda memory. **Sidecar metadata**: FAISS stores only vectors and returns integer indices. The `embeddings.json` sidecar maps index positions back to `{page_path, chunk_index, chunk_text_preview}`. This file is loaded alongside the FAISS index. **Search flow**: 1. Embed query using MiniLM (loaded at Lambda init) 2. Load FAISS index + sidecar from EFS (~5ms, already mounted) 3. Search top K×3 vectors (~<1ms) 4. Deduplicate by page_path, keep best chunk per page 5. Return top K results with page paths and matching chunk snippets ### Cost estimate - Embedding a 200-page wiki: effectively $0 (Lambda compute only, ~seconds) - Per search query: $0 (MiniLM runs locally) - Re-embedding on page edits: negligible (DynamoDB write + Lambda invocation) - VPC endpoints: $0 (uses existing DynamoDB gateway endpoint) --- ## URL Structure Each user gets a subdomain: `{username}.wikibot.io` ``` sderle.wikibot.io/ → user's wiki list (dashboard) sderle.wikibot.io/third-gulf-war/ → wiki web UI (free wiki, under user namespace) sderle.wikibot.io/third-gulf-war/api/v1/ → wiki REST API sderle.wikibot.io/third-gulf-war/mcp → wiki MCP endpoint ``` ### Custom slugs (paid wikis) Paid wikis get a top-level slug: `{slug}.wikibot.io`. This is a vanity URL that routes directly to the wiki without the username prefix. The slug is chosen at wiki creation time and must be globally unique (same validation rules as usernames: lowercase alphanumeric + hyphens, 3–30 characters, drawn from the same namespace/blocklist). ``` third-gulf-war.wikibot.io/ → wiki web UI (paid wiki, top-level slug) third-gulf-war.wikibot.io/api/v1/ → wiki REST API third-gulf-war.wikibot.io/mcp → wiki MCP endpoint ``` The user-namespace URL (`sderle.wikibot.io/third-gulf-war/`) continues to work as a redirect. This means existing MCP connections and bookmarks survive if a free wiki is later upgraded to paid. Implementation: the Lambda resolver checks the subdomain against the Wikis table's `custom_slug` GSI first, then falls back to username resolution. --- ## Usernames Each user chooses a username at signup (after OAuth). Usernames are URL-critical (`{username}.wikibot.io`) so they must be: - **URL-safe**: lowercase alphanumeric + hyphens, 3–30 characters, no leading/trailing hyphens - **Unique**: enforced in DynamoDB - **Immutable** (MVP): changing usernames means changing URLs, which breaks MCP connections, Git remotes, bookmarks. Defer username changes (with redirect support) to a future iteration. - **Reserved**: block names that conflict with platform routes or look official: `admin`, `www`, `api`, `auth`, `mcp`, `app`, `help`, `support`, `billing`, `status`, `blog`, `docs`, `robot`, `wiki`, `static`, `assets`, `null`, `undefined`, etc. Maintain a blocklist. ### Username squatting Free accounts cost nothing to create, so squatting is possible. Mitigations: - Require at least one wiki with at least one page edit within 90 days of signup, or the username is released - Trademark disputes handled case-by-case (standard UDRP-like process, documented in ToS) - Not a launch concern — address when it becomes a real problem