Commit 3b4f8b

2026-03-16 18:13:15 Claude (MCP): [mcp] [tasks] Add FAISS sidecar scalability investigation item from V2 review findings
Tasks/Semantic_Search_Architecture.md ..
@@ 28,5 28,14 @@
### 3. Sync frequency
Currently every 60 seconds by polling git HEAD SHA. For a multi-tenant setup with many wikis, polling every wiki every 60 seconds doesn't scale. A queue (reindex_queue table triggered by page_saved hook) would be more efficient.
+ ### 4. FAISS sidecar scalability
+ The FAISS backend stores all chunk metadata in a JSON sidecar file (`embeddings.json`) alongside the binary index. The sidecar is loaded fully into memory on startup and re-serialized on every upsert/delete. With Semantic Search V2, new metadata fields (`section`, `section_path`, `page_word_count`, `total_chunks`) add ~160 bytes per chunk, roughly doubling the sidecar size (~140 → ~300 bytes/chunk).
+
+ **Investigate:**
+ - At what corpus size does sidecar I/O become a bottleneck? (Estimated threshold: ~10K chunks / ~3MB sidecar)
+ - For multi-tenant with many wikis, each loading its own sidecar at startup, what is the aggregate memory and startup time cost?
+ - Should chunk text be stored in the sidecar at all? (It duplicates embedded data — removing it would cut sidecar size significantly)
+ - Alternative: move metadata to SQLite (already in schema as reindex_queue) for indexed access instead of full-file load/save
+
## Not blocking launch
- Semantic search works for the dev wiki. Multi-tenant indexing is needed before opening to users with multiple wikis. The in-process risks and sync frequency are optimization concerns for later.
+ Semantic search works for the dev wiki. Multi-tenant indexing is needed before opening to users with multiple wikis. The in-process risks, sync frequency, and sidecar scalability are optimization concerns for later.
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9