Properties

category: reference
tags: [tasks, semantic-search, chromadb, multi-tenant]
last_updated: 2026-03-15
confidence: high

Semantic Search Multi-Tenant Fix

Problem

The otterwiki-semantic-search plugin is single-tenant. On the robot.wtf VPS:

One shared ChromaDB collection (otterwiki_pages) for all wikis. Page paths have no wiki slug prefix — two wikis with a page named "Home" would collide.
One sync thread started at boot, tied to the default wiki's storage object. The TenantResolver._swap_storage() patches _state["storage"] per-request, but the sync thread holds a reference to the original storage and never sees other wikis.
reindex_all wipes everything — it calls backend.reset() which drops and recreates the entire shared collection. Reindexing one wiki destroys the other's index.
No auto-index for new wikis — there's no trigger when a wiki is first accessed. The page_saved hook catches future saves, but won't back-fill existing pages.
Embedding model download — ChromaDB's default ONNX MiniLM embedding function needs to download the model on first use. The robot service user needs a writable cache directory (HOME=/srv). Even with this, the embedding silently fails and produces zero indexed documents.

ChromaDB server is running on port 8004
numpy is importable (pinned <2.4.0)
The plugin initializes and connects to ChromaDB
But zero documents are indexed for the dev wiki
Attempted manual reindex_all via Python — says "complete" but collection count stays 0

FAISS indexes are per-directory — natural per-wiki isolation
Local MiniLM embedding (no model download issue — bundled)
The wikibot.io Lambda deployment already used FAISS + MiniLM
The otterwiki-semantic-search plugin already has a FAISS backend (VECTOR_BACKEND=faiss)
But FAISS needs explicit index management (build, save, load)
And the Lambda deployment used Bedrock for embedding — local MiniLM needs the model on disk

Use ChromaDB but provide our own embedding function (MiniLM loaded locally) instead of relying on ChromaDB's default ONNX embedding
Solves the model download issue
Still needs multi-tenant collection fix

Which approach to take. The answer depends on:

Is the ChromaDB embedding function the only reason reindex produces 0 results? (Debug this first)
Is per-wiki FAISS simpler than per-wiki ChromaDB collections?
Do we want to maintain two backends or pick one?

Dev/Proxmox_CPU_Type — numpy X86_V2 issue (workaround in place)
Design/Async_Embedding_Pipeline — original FAISS + MiniLM design (AWS, archived)