Blame

d105f8 robot.wtf 2026-03-15 22:37:48
Move misplaced pages from default wiki
1
---
2
category: reference
3
tags: [tasks, semantic-search, chromadb, multi-tenant]
4
last_updated: 2026-03-15
5
confidence: high
6
---
7
8
# Semantic Search Multi-Tenant Fix
9
10
## Problem
11
12
The otterwiki-semantic-search plugin is single-tenant. On the robot.wtf VPS:
13
14
1. **One shared ChromaDB collection** (`otterwiki_pages`) for all wikis. Page paths have no wiki slug prefix — two wikis with a page named "Home" would collide.
15
16
2. **One sync thread** started at boot, tied to the default wiki's storage object. The `TenantResolver._swap_storage()` patches `_state["storage"]` per-request, but the sync thread holds a reference to the original storage and never sees other wikis.
17
18
3. **`reindex_all` wipes everything** — it calls `backend.reset()` which drops and recreates the entire shared collection. Reindexing one wiki destroys the other's index.
19
20
4. **No auto-index for new wikis** — there's no trigger when a wiki is first accessed. The `page_saved` hook catches future saves, but won't back-fill existing pages.
21
22
5. **Embedding model download** — ChromaDB's default ONNX MiniLM embedding function needs to download the model on first use. The `robot` service user needs a writable cache directory (`HOME=/srv`). Even with this, the embedding silently fails and produces zero indexed documents.
23
24
## Immediate status
25
26
- ChromaDB server is running on port 8004
27
- numpy is importable (pinned <2.4.0)
28
- The plugin initializes and connects to ChromaDB
29
- But zero documents are indexed for the dev wiki
30
- Attempted manual `reindex_all` via Python — says "complete" but collection count stays 0
31
32
## Options
33
34
### Option A: Fix ChromaDB multi-tenant (more work)
35
- Per-wiki collections: `otterwiki_pages_{slug}`
36
- Per-wiki sync state: `chroma_sync_state_{slug}.json`
37
- `reindex_all` scoped to one collection
38
- Sync thread needs per-wiki awareness or one thread per wiki
39
- Changes to: otterwiki-semantic-search plugin
40
41
### Option B: Switch back to FAISS (different tradeoffs)
42
- FAISS indexes are per-directory — natural per-wiki isolation
43
- Local MiniLM embedding (no model download issue — bundled)
44
- The wikibot.io Lambda deployment already used FAISS + MiniLM
45
- The otterwiki-semantic-search plugin already has a FAISS backend (`VECTOR_BACKEND=faiss`)
46
- But FAISS needs explicit index management (build, save, load)
47
- And the Lambda deployment used Bedrock for embedding — local MiniLM needs the model on disk
48
49
### Option C: Hybrid — ChromaDB with explicit embedding function
50
- Use ChromaDB but provide our own embedding function (MiniLM loaded locally) instead of relying on ChromaDB's default ONNX embedding
51
- Solves the model download issue
52
- Still needs multi-tenant collection fix
53
54
## Decision needed
55
56
Which approach to take. The answer depends on:
57
- Is the ChromaDB embedding function the only reason reindex produces 0 results? (Debug this first)
58
- Is per-wiki FAISS simpler than per-wiki ChromaDB collections?
59
- Do we want to maintain two backends or pick one?
60
61
## Related
62
- [[Dev/Proxmox_CPU_Type]] — numpy X86_V2 issue (workaround in place)
63
- [[Design/Async_Embedding_Pipeline]] — original FAISS + MiniLM design (AWS, archived)