Blame
|
1 | This page is part of the **wikibot.io PRD** (Product Requirements Document). See also: [[Design/Platform_Overview]], [[Design/Auth]], [[Design/Implementation_Phases]], [[Design/Operations]]. |
||||||
|
2 | |||||||
| 3 | --- |
|||||||
| 4 | ||||||||
| 5 | ## Data Model |
|||||||
| 6 | ||||||||
|
7 | > **Superseded.** This page describes the DynamoDB/EFS data model for wikibot.io. See [[Design/VPS_Architecture]] for the current plan (SQLite, local disk). The ACL model and storage layout concepts carry forward; the DynamoDB-specific schema does not. |
||||||
| 8 | ||||||||
|
9 | DynamoDB tables. Partition keys noted in comments. |
||||||
| 10 | ||||||||
| 11 | #### Users |
|||||||
| 12 | ||||||||
| 13 | ``` |
|||||||
| 14 | User { |
|||||||
| 15 | id: string, // platform-generated (UUID) |
|||||||
| 16 | email: string, |
|||||||
| 17 | display_name: string, |
|||||||
| 18 | oauth_provider: string, // "google" | "github" | "microsoft" | "apple" |
|||||||
| 19 | oauth_provider_sub: string, // provider-native subject ID (e.g., Google sub claim) |
|||||||
| 20 | // GSI on (oauth_provider, oauth_provider_sub) for login lookup |
|||||||
| 21 | // Critical: enables migration off WorkOS or any auth provider |
|||||||
| 22 | created_at: ISO8601, |
|||||||
| 23 | wiki_count: number, |
|||||||
| 24 | stripe_customer_id?: string |
|||||||
| 25 | } |
|||||||
| 26 | ``` |
|||||||
| 27 | ||||||||
|
28 | Note: the User model is deliberately thin on pricing fields. Under Option A (flat tier), add `tier: "free" | "premium"` and `wiki_limit: number`. Under Option B (per-wiki), no tier field is needed — billing state lives on each Wiki record. See [[Design/Implementation_Phases]] for pricing options. |
||||||
| 29 | ||||||||
|
30 | #### Wikis |
||||||
| 31 | ||||||||
| 32 | ``` |
|||||||
| 33 | Wiki { |
|||||||
| 34 | owner_id: string, // User.id |
|||||||
|
35 | wiki_slug: string, // URL-safe identifier (under user namespace) |
||||||
| 36 | custom_slug?: string, // paid wikis: top-level slug for {slug}.wikibot.io |
|||||||
|
37 | display_name: string, |
||||||
| 38 | repo_path: string, // EFS path: /mnt/efs/{user_id}/{wiki_slug}/repo.git |
|||||||
|
39 | index_path?: string, // FAISS index location (on EFS alongside repo) |
||||||
|
40 | mcp_token_hash: string, // bcrypt hash of MCP bearer token |
||||||
| 41 | is_public: boolean, // read-only public access |
|||||||
|
42 | is_paid: boolean, // whether this wiki requires payment (i.e., not the free wiki) |
||||||
| 43 | payment_status: "active" | "lapsed" | "free", |
|||||||
| 44 | // free = the user's one free wiki |
|||||||
| 45 | // active = paid and current |
|||||||
| 46 | // lapsed = payment failed/canceled → read-only, MCP disabled |
|||||||
|
47 | created_at: ISO8601, |
||||||
| 48 | last_accessed: ISO8601, |
|||||||
| 49 | page_count: number, |
|||||||
| 50 | } |
|||||||
| 51 | ``` |
|||||||
| 52 | ||||||||
| 53 | #### ACLs |
|||||||
| 54 | ||||||||
| 55 | ``` |
|||||||
| 56 | ACL { |
|||||||
| 57 | wiki_id: string, // owner_id + wiki_slug |
|||||||
| 58 | grantee_id: string, // User.id |
|||||||
| 59 | role: "owner" | "editor" | "viewer", |
|||||||
| 60 | granted_by: string, |
|||||||
| 61 | granted_at: ISO8601 |
|||||||
| 62 | } |
|||||||
| 63 | ``` |
|||||||
| 64 | ||||||||
| 65 | ### Storage layout (EFS) |
|||||||
| 66 | ||||||||
| 67 | ``` |
|||||||
| 68 | /mnt/efs/ |
|||||||
| 69 | {user_id}/ |
|||||||
| 70 | {wiki_slug}/ |
|||||||
| 71 | repo.git/ # bare git repo — persistent filesystem |
|||||||
|
72 | index.faiss # FAISS vector index |
||||||
|
73 | embeddings.json # page_path → vector mapping |
||||||
| 74 | ``` |
|||||||
| 75 | ||||||||
| 76 | --- |
|||||||
| 77 | ||||||||
| 78 | ## Git Storage Mechanics |
|||||||
| 79 | ||||||||
| 80 | ### EFS-backed git repos |
|||||||
| 81 | ||||||||
| 82 | Each wiki's bare git repo lives on a persistent filesystem mounted by the compute layer. No clone/push cycle, no caching, no locks — git operations happen directly on disk. |
|||||||
| 83 | ||||||||
| 84 | **Read path:** |
|||||||
| 85 | ``` |
|||||||
| 86 | 1. Lambda mounts EFS (already attached in VPC) |
|||||||
| 87 | 2. Open bare repo at /mnt/efs/{user}/{wiki}/repo.git |
|||||||
| 88 | 3. Read page from repo |
|||||||
| 89 | ``` |
|||||||
| 90 | ||||||||
| 91 | **Write path:** |
|||||||
| 92 | ``` |
|||||||
| 93 | 1. Open bare repo at /mnt/efs/{user}/{wiki}/repo.git |
|||||||
| 94 | 2. Commit page change |
|||||||
|
95 | 3. Write reindex record to DynamoDB ReindexQueue table |
||||||
| 96 | (triggers embedding Lambda via DynamoDB Streams — see Semantic Search section) |
|||||||
|
97 | ``` |
||||||
| 98 | ||||||||
| 99 | **Concurrency**: NFS handles file-level locking natively. Git's own locking (`index.lock`) works correctly on NFS. Concurrent reads are unlimited. Concurrent writes to the same repo are serialized by git's lock file. No application-level locking needed. |
|||||||
| 100 | ||||||||
| 101 | **Consistency**: Writes are immediately visible to all Lambda invocations mounting the same EFS filesystem. No eventual consistency concerns. |
|||||||
| 102 | ||||||||
| 103 | ### Fallback: S3 clone-on-demand |
|||||||
| 104 | ||||||||
| 105 | If Phase 0 testing shows EFS latency or VPC cold starts are unacceptable, fall back to S3-based repos with a DynamoDB write lock + clone-to-/tmp pattern. This adds significant complexity (locking, cache management, /tmp eviction) and is only worth pursuing if EFS fails testing. |
|||||||
| 106 | ||||||||
| 107 | --- |
|||||||
| 108 | ||||||||
|
109 | ## Semantic Search |
||||||
|
110 | |||||||
|
111 | Semantic search is available to all users (not tier-gated). See [[Design/Async_Embedding_Pipeline]] for the full architecture. |
||||||
| 112 | ||||||||
| 113 | ### Embedding pipeline (summary) |
|||||||
|
114 | |||||||
| 115 | ``` |
|||||||
|
116 | Page write (wiki Lambda, VPC) |
||||||
| 117 | → DynamoDB write to ReindexQueue table (free gateway endpoint, already deployed) |
|||||||
| 118 | → DynamoDB Streams captures the change |
|||||||
| 119 | → Lambda service polls the stream (outside function's VPC context) |
|||||||
| 120 | → Embedding Lambda (VPC, EFS mount): |
|||||||
|
121 | 1. Read page content from EFS repo |
||||||
|
122 | 2. Chunk page (same algorithm as otterwiki-semantic-search) |
||||||
| 123 | 3. Embed chunks using all-MiniLM-L6-v2 (runs locally, no external API) |
|||||||
| 124 | 4. Update FAISS index + sidecar metadata on EFS |
|||||||
|
125 | ``` |
||||||
| 126 | ||||||||
|
127 | No Bedrock, no SQS, no new VPC endpoints. Total fixed cost: $0. |
||||||
| 128 | ||||||||
|
129 | ### FAISS details |
||||||
| 130 | ||||||||
| 131 | FAISS (Facebook AI Similarity Search) is a C++ library with Python bindings for nearest-neighbor search over dense vectors. |
|||||||
| 132 | ||||||||
| 133 | **Index type**: `IndexFlatIP` (flat index, inner product similarity). For wikis under ~1000 pages, brute-force search is fast enough (<1ms) and requires no training or tuning. The index is just a matrix of vectors. |
|||||||
| 134 | ||||||||
|
135 | **Index size**: Each MiniLM vector is 384 floats × 4 bytes = 1.5KB. A 200-page wiki with ~3 chunks per page = 600 vectors = ~900KB index. Trivial to store on EFS and load into Lambda memory. |
||||||
|
136 | |||||||
| 137 | **Sidecar metadata**: FAISS stores only vectors and returns integer indices. The `embeddings.json` sidecar maps index positions back to `{page_path, chunk_index, chunk_text_preview}`. This file is loaded alongside the FAISS index. |
|||||||
| 138 | ||||||||
| 139 | **Search flow**: |
|||||||
|
140 | 1. Embed query using MiniLM (loaded at Lambda init) |
||||||
|
141 | 2. Load FAISS index + sidecar from EFS (~5ms, already mounted) |
||||||
| 142 | 3. Search top K×3 vectors (~<1ms) |
|||||||
| 143 | 4. Deduplicate by page_path, keep best chunk per page |
|||||||
| 144 | 5. Return top K results with page paths and matching chunk snippets |
|||||||
| 145 | ||||||||
| 146 | ### Cost estimate |
|||||||
| 147 | ||||||||
|
148 | - Embedding a 200-page wiki: effectively $0 (Lambda compute only, ~seconds) |
||||||
| 149 | - Per search query: $0 (MiniLM runs locally) |
|||||||
| 150 | - Re-embedding on page edits: negligible (DynamoDB write + Lambda invocation) |
|||||||
| 151 | - VPC endpoints: $0 (uses existing DynamoDB gateway endpoint) |
|||||||
|
152 | |||||||
| 153 | --- |
|||||||
| 154 | ||||||||
| 155 | ## URL Structure |
|||||||
| 156 | ||||||||
| 157 | Each user gets a subdomain: `{username}.wikibot.io` |
|||||||
| 158 | ||||||||
| 159 | ``` |
|||||||
| 160 | sderle.wikibot.io/ → user's wiki list (dashboard) |
|||||||
|
161 | sderle.wikibot.io/third-gulf-war/ → wiki web UI (free wiki, under user namespace) |
||||||
|
162 | sderle.wikibot.io/third-gulf-war/api/v1/ → wiki REST API |
||||||
| 163 | sderle.wikibot.io/third-gulf-war/mcp → wiki MCP endpoint |
|||||||
| 164 | ``` |
|||||||
| 165 | ||||||||
|
166 | ### Custom slugs (paid wikis) |
||||||
|
167 | |||||||
|
168 | Paid wikis get a top-level slug: `{slug}.wikibot.io`. This is a vanity URL that routes directly to the wiki without the username prefix. The slug is chosen at wiki creation time and must be globally unique (same validation rules as usernames: lowercase alphanumeric + hyphens, 3–30 characters, drawn from the same namespace/blocklist). |
||||||
|
169 | |||||||
| 170 | ``` |
|||||||
|
171 | third-gulf-war.wikibot.io/ → wiki web UI (paid wiki, top-level slug) |
||||||
| 172 | third-gulf-war.wikibot.io/api/v1/ → wiki REST API |
|||||||
| 173 | third-gulf-war.wikibot.io/mcp → wiki MCP endpoint |
|||||||
|
174 | ``` |
||||||
| 175 | ||||||||
|
176 | The user-namespace URL (`sderle.wikibot.io/third-gulf-war/`) continues to work as a redirect. This means existing MCP connections and bookmarks survive if a free wiki is later upgraded to paid. |
||||||
| 177 | ||||||||
| 178 | Implementation: the Lambda resolver checks the subdomain against the Wikis table's `custom_slug` GSI first, then falls back to username resolution. |
|||||||
|
179 | |||||||
| 180 | --- |
|||||||
| 181 | ||||||||
| 182 | ## Usernames |
|||||||
| 183 | ||||||||
| 184 | Each user chooses a username at signup (after OAuth). Usernames are URL-critical (`{username}.wikibot.io`) so they must be: |
|||||||
| 185 | ||||||||
| 186 | - **URL-safe**: lowercase alphanumeric + hyphens, 3–30 characters, no leading/trailing hyphens |
|||||||
| 187 | - **Unique**: enforced in DynamoDB |
|||||||
| 188 | - **Immutable** (MVP): changing usernames means changing URLs, which breaks MCP connections, Git remotes, bookmarks. Defer username changes (with redirect support) to a future iteration. |
|||||||
| 189 | - **Reserved**: block names that conflict with platform routes or look official: `admin`, `www`, `api`, `auth`, `mcp`, `app`, `help`, `support`, `billing`, `status`, `blog`, `docs`, `robot`, `wiki`, `static`, `assets`, `null`, `undefined`, etc. Maintain a blocklist. |
|||||||
| 190 | ||||||||
| 191 | ### Username squatting |
|||||||
| 192 | ||||||||
| 193 | Free accounts cost nothing to create, so squatting is possible. Mitigations: |
|||||||
| 194 | - Require at least one wiki with at least one page edit within 90 days of signup, or the username is released |
|||||||
| 195 | - Trademark disputes handled case-by-case (standard UDRP-like process, documented in ToS) |
|||||||
| 196 | - Not a launch concern — address when it becomes a real problem |
|||||||