Data Model (8c343d)

This page is part of the wikibot.io PRD (Product Requirements Document). See also: Design/Platform_Overview, Design/Auth, Design/Implementation_Phases, Design/Operations.

Data Model

Superseded. This page describes the DynamoDB/EFS data model for wikibot.io. See Design/VPS_Architecture for the current plan (SQLite, local disk). The ACL model and storage layout concepts carry forward; the DynamoDB-specific schema does not.

DynamoDB tables. Partition keys noted in comments.

Users

User {
  id: string,                  // platform-generated (UUID)
  email: string,
  display_name: string,
  oauth_provider: string,      // "google" | "github" | "microsoft" | "apple"
  oauth_provider_sub: string,  // provider-native subject ID (e.g., Google sub claim)
                               // GSI on (oauth_provider, oauth_provider_sub) for login lookup
                               // Critical: enables migration off WorkOS or any auth provider
  created_at: ISO8601,
  wiki_count: number,
  stripe_customer_id?: string
}

Note: the User model is deliberately thin on pricing fields. Under Option A (flat tier), add tier: "free" | "premium" and wiki_limit: number. Under Option B (per-wiki), no tier field is needed — billing state lives on each Wiki record. See Design/Implementation_Phases for pricing options.

Wikis

Wiki {
  owner_id: string,            // User.id
  wiki_slug: string,           // URL-safe identifier (under user namespace)
  custom_slug?: string,        // paid wikis: top-level slug for {slug}.wikibot.io
  display_name: string,
  repo_path: string,           // EFS path: /mnt/efs/{user_id}/{wiki_slug}/repo.git
  index_path?: string,         // FAISS index location (on EFS alongside repo)
  mcp_token_hash: string,      // bcrypt hash of MCP bearer token
  is_public: boolean,          // read-only public access
  is_paid: boolean,            // whether this wiki requires payment (i.e., not the free wiki)
  payment_status: "active" | "lapsed" | "free",
                               // free = the user's one free wiki
                               // active = paid and current
                               // lapsed = payment failed/canceled → read-only, MCP disabled
  created_at: ISO8601,
  last_accessed: ISO8601,
  page_count: number,
}

ACLs

ACL {
  wiki_id: string,             // owner_id + wiki_slug
  grantee_id: string,          // User.id
  role: "owner" | "editor" | "viewer",
  granted_by: string,
  granted_at: ISO8601
}

Storage layout (EFS)

/mnt/efs/
  {user_id}/
    {wiki_slug}/
      repo.git/              # bare git repo — persistent filesystem
      index.faiss            # FAISS vector index
      embeddings.json        # page_path → vector mapping

Git Storage Mechanics

EFS-backed git repos

Each wiki's bare git repo lives on a persistent filesystem mounted by the compute layer. No clone/push cycle, no caching, no locks — git operations happen directly on disk.

Read path:

1. Lambda mounts EFS (already attached in VPC)
2. Open bare repo at /mnt/efs/{user}/{wiki}/repo.git
3. Read page from repo

Write path:

1. Open bare repo at /mnt/efs/{user}/{wiki}/repo.git
2. Commit page change
3. Write reindex record to DynamoDB ReindexQueue table
   (triggers embedding Lambda via DynamoDB Streams — see Semantic Search section)

Concurrency: NFS handles file-level locking natively. Git's own locking (index.lock) works correctly on NFS. Concurrent reads are unlimited. Concurrent writes to the same repo are serialized by git's lock file. No application-level locking needed.

Consistency: Writes are immediately visible to all Lambda invocations mounting the same EFS filesystem. No eventual consistency concerns.

Fallback: S3 clone-on-demand

If Phase 0 testing shows EFS latency or VPC cold starts are unacceptable, fall back to S3-based repos with a DynamoDB write lock + clone-to-/tmp pattern. This adds significant complexity (locking, cache management, /tmp eviction) and is only worth pursuing if EFS fails testing.

Semantic Search

Semantic search is available to all users (not tier-gated). See Design/Async_Embedding_Pipeline for the full architecture.

Embedding pipeline (summary)

Page write (wiki Lambda, VPC)
  → DynamoDB write to ReindexQueue table (free gateway endpoint, already deployed)
  → DynamoDB Streams captures the change
  → Lambda service polls the stream (outside function's VPC context)
  → Embedding Lambda (VPC, EFS mount):
      1. Read page content from EFS repo
      2. Chunk page (same algorithm as otterwiki-semantic-search)
      3. Embed chunks using all-MiniLM-L6-v2 (runs locally, no external API)
      4. Update FAISS index + sidecar metadata on EFS

No Bedrock, no SQS, no new VPC endpoints. Total fixed cost: $0.

FAISS details

FAISS (Facebook AI Similarity Search) is a C++ library with Python bindings for nearest-neighbor search over dense vectors.

Index type: IndexFlatIP (flat index, inner product similarity). For wikis under ~1000 pages, brute-force search is fast enough (<1ms) and requires no training or tuning. The index is just a matrix of vectors.

Index size: Each MiniLM vector is 384 floats × 4 bytes = 1.5KB. A 200-page wiki with ~3 chunks per page = 600 vectors = ~900KB index. Trivial to store on EFS and load into Lambda memory.

Sidecar metadata: FAISS stores only vectors and returns integer indices. The embeddings.json sidecar maps index positions back to {page_path, chunk_index, chunk_text_preview}. This file is loaded alongside the FAISS index.

Search flow:

Embed query using MiniLM (loaded at Lambda init)
Load FAISS index + sidecar from EFS (~5ms, already mounted)
Search top K×3 vectors (~<1ms)
Deduplicate by page_path, keep best chunk per page
Return top K results with page paths and matching chunk snippets

Cost estimate

Embedding a 200-page wiki: effectively $0 (Lambda compute only, ~seconds)
Per search query: $0 (MiniLM runs locally)
Re-embedding on page edits: negligible (DynamoDB write + Lambda invocation)
VPC endpoints: $0 (uses existing DynamoDB gateway endpoint)

URL Structure

Each user gets a subdomain: {username}.wikibot.io

sderle.wikibot.io/                          → user's wiki list (dashboard)
sderle.wikibot.io/third-gulf-war/           → wiki web UI (free wiki, under user namespace)
sderle.wikibot.io/third-gulf-war/api/v1/    → wiki REST API
sderle.wikibot.io/third-gulf-war/mcp        → wiki MCP endpoint

Custom slugs (paid wikis)

Paid wikis get a top-level slug: {slug}.wikibot.io. This is a vanity URL that routes directly to the wiki without the username prefix. The slug is chosen at wiki creation time and must be globally unique (same validation rules as usernames: lowercase alphanumeric + hyphens, 3–30 characters, drawn from the same namespace/blocklist).

third-gulf-war.wikibot.io/                  → wiki web UI (paid wiki, top-level slug)
third-gulf-war.wikibot.io/api/v1/           → wiki REST API
third-gulf-war.wikibot.io/mcp              → wiki MCP endpoint

The user-namespace URL (sderle.wikibot.io/third-gulf-war/) continues to work as a redirect. This means existing MCP connections and bookmarks survive if a free wiki is later upgraded to paid.

Implementation: the Lambda resolver checks the subdomain against the Wikis table's custom_slug GSI first, then falls back to username resolution.

Usernames

Each user chooses a username at signup (after OAuth). Usernames are URL-critical ({username}.wikibot.io) so they must be:

URL-safe: lowercase alphanumeric + hyphens, 3–30 characters, no leading/trailing hyphens
Unique: enforced in DynamoDB
Immutable (MVP): changing usernames means changing URLs, which breaks MCP connections, Git remotes, bookmarks. Defer username changes (with redirect support) to a future iteration.
Reserved: block names that conflict with platform routes or look official: admin, www, api, auth, mcp, app, help, support, billing, status, blog, docs, robot, wiki, static, assets, null, undefined, etc. Maintain a blocklist.

Username squatting

Free accounts cost nothing to create, so squatting is possible. Mitigations:

Require at least one wiki with at least one page edit within 90 days of signup, or the username is released
Trademark disputes handled case-by-case (standard UDRP-like process, documented in ToS)
Not a launch concern — address when it becomes a real problem