Blame

d105f8 robot.wtf 2026-03-15 22:37:48
Move misplaced pages from default wiki
1
---
2
category: reference
3
tags: [tasks, semantic-search, architecture]
4
last_updated: 2026-03-15
5
confidence: high
6
---
7
8
# Semantic Search Architecture Issues
9
10
## Current state
11
FAISS + ONNX MiniLM embedding, running in-process in the gunicorn worker. Works for single-tenant. 65 pages indexed for dev wiki.
12
13
## Issues to address
14
15
### 1. Multi-tenant indexing (blocking)
16
The sync thread watches one wiki (whichever storage was set at startup). TenantResolver swaps storage per-request, but the sync thread holds the original reference. Each wiki needs its own FAISS index directory and its own sync state. The reindex_all function also wipes and rebuilds the entire shared index.
17
18
**Needed:** Per-wiki FAISS directories (`/srv/data/faiss/{slug}/`), per-wiki sync state, sync thread that iterates over all wikis or per-wiki threads.
19
20
### 2. In-process embedding risks
21
The ONNX model (~80MB) loads in the gunicorn worker. The sync thread is a daemon thread — killed without cleanup on SIGTERM. If killed mid-write to the FAISS index, the index could corrupt (recovered by full reindex on next start, but that's slow).
22
23
**Options:**
24
- Separate embedding worker process (like ChromaDB was, but lighter)
25
- Queue-based: page saves write to a queue (SQLite reindex_queue table already in schema), worker process reads and embeds
26
- Graceful shutdown handler in sync thread
27
28
### 3. Sync frequency
29
Currently every 60 seconds by polling git HEAD SHA. For a multi-tenant setup with many wikis, polling every wiki every 60 seconds doesn't scale. A queue (reindex_queue table triggered by page_saved hook) would be more efficient.
30
3b4f8b Claude (MCP) 2026-03-16 18:13:15
[mcp] [tasks] Add FAISS sidecar scalability investigation item from V2 review findings
31
### 4. FAISS sidecar scalability
32
The FAISS backend stores all chunk metadata in a JSON sidecar file (`embeddings.json`) alongside the binary index. The sidecar is loaded fully into memory on startup and re-serialized on every upsert/delete. With Semantic Search V2, new metadata fields (`section`, `section_path`, `page_word_count`, `total_chunks`) add ~160 bytes per chunk, roughly doubling the sidecar size (~140 → ~300 bytes/chunk).
33
34
**Investigate:**
35
- At what corpus size does sidecar I/O become a bottleneck? (Estimated threshold: ~10K chunks / ~3MB sidecar)
36
- For multi-tenant with many wikis, each loading its own sidecar at startup, what is the aggregate memory and startup time cost?
37
- Should chunk text be stored in the sidecar at all? (It duplicates embedded data — removing it would cut sidecar size significantly)
38
- Alternative: move metadata to SQLite (already in schema as reindex_queue) for indexed access instead of full-file load/save
39
d105f8 robot.wtf 2026-03-15 22:37:48
Move misplaced pages from default wiki
40
## Not blocking launch
3b4f8b Claude (MCP) 2026-03-16 18:13:15
[mcp] [tasks] Add FAISS sidecar scalability investigation item from V2 review findings
41
Semantic search works for the dev wiki. Multi-tenant indexing is needed before opening to users with multiple wikis. The in-process risks, sync frequency, and sidecar scalability are optimization concerns for later.