Blame
|
1 | --- |
||||||
| 2 | category: reference |
|||||||
| 3 | tags: [tasks, semantic-search, architecture] |
|||||||
| 4 | last_updated: 2026-03-15 |
|||||||
| 5 | confidence: high |
|||||||
| 6 | --- |
|||||||
| 7 | ||||||||
| 8 | # Semantic Search Architecture Issues |
|||||||
| 9 | ||||||||
| 10 | ## Current state |
|||||||
| 11 | FAISS + ONNX MiniLM embedding, running in-process in the gunicorn worker. Works for single-tenant. 65 pages indexed for dev wiki. |
|||||||
| 12 | ||||||||
| 13 | ## Issues to address |
|||||||
| 14 | ||||||||
| 15 | ### 1. Multi-tenant indexing (blocking) |
|||||||
| 16 | The sync thread watches one wiki (whichever storage was set at startup). TenantResolver swaps storage per-request, but the sync thread holds the original reference. Each wiki needs its own FAISS index directory and its own sync state. The reindex_all function also wipes and rebuilds the entire shared index. |
|||||||
| 17 | ||||||||
| 18 | **Needed:** Per-wiki FAISS directories (`/srv/data/faiss/{slug}/`), per-wiki sync state, sync thread that iterates over all wikis or per-wiki threads. |
|||||||
| 19 | ||||||||
| 20 | ### 2. In-process embedding risks |
|||||||
| 21 | The ONNX model (~80MB) loads in the gunicorn worker. The sync thread is a daemon thread — killed without cleanup on SIGTERM. If killed mid-write to the FAISS index, the index could corrupt (recovered by full reindex on next start, but that's slow). |
|||||||
| 22 | ||||||||
| 23 | **Options:** |
|||||||
| 24 | - Separate embedding worker process (like ChromaDB was, but lighter) |
|||||||
| 25 | - Queue-based: page saves write to a queue (SQLite reindex_queue table already in schema), worker process reads and embeds |
|||||||
| 26 | - Graceful shutdown handler in sync thread |
|||||||
| 27 | ||||||||
| 28 | ### 3. Sync frequency |
|||||||
| 29 | Currently every 60 seconds by polling git HEAD SHA. For a multi-tenant setup with many wikis, polling every wiki every 60 seconds doesn't scale. A queue (reindex_queue table triggered by page_saved hook) would be more efficient. |
|||||||
| 30 | ||||||||
|
31 | ### 4. FAISS sidecar scalability |
||||||
| 32 | The FAISS backend stores all chunk metadata in a JSON sidecar file (`embeddings.json`) alongside the binary index. The sidecar is loaded fully into memory on startup and re-serialized on every upsert/delete. With Semantic Search V2, new metadata fields (`section`, `section_path`, `page_word_count`, `total_chunks`) add ~160 bytes per chunk, roughly doubling the sidecar size (~140 → ~300 bytes/chunk). |
|||||||
| 33 | ||||||||
| 34 | **Investigate:** |
|||||||
| 35 | - At what corpus size does sidecar I/O become a bottleneck? (Estimated threshold: ~10K chunks / ~3MB sidecar) |
|||||||
| 36 | - For multi-tenant with many wikis, each loading its own sidecar at startup, what is the aggregate memory and startup time cost? |
|||||||
| 37 | - Should chunk text be stored in the sidecar at all? (It duplicates embedded data — removing it would cut sidecar size significantly) |
|||||||
| 38 | - Alternative: move metadata to SQLite (already in schema as reindex_queue) for indexed access instead of full-file load/save |
|||||||
| 39 | ||||||||
|
40 | ## Not blocking launch |
||||||
|
41 | Semantic search works for the dev wiki. Multi-tenant indexing is needed before opening to users with multiple wikis. The in-process risks, sync frequency, and sidecar scalability are optimization concerns for later. |
||||||