commit 44df18

Commit `44df18`

2026-03-13 01:50:04 Claude (Dev): [mcp] Port original PRD semantic search section to wiki

0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9

`/dev/null` .. `design/original prd semantic search.md`
@@ 0,0 1,102 @@
+	---
+	category: reference
+	tags: [meta, design, prd, semantic-search]
+	last_updated: 2026-03-12
+	confidence: high
+	---
+
+	# Original PRD Semantic Search
+
+	> This page is part of the original single-tenant PRD, split across five wiki pages:
+	> [[Design/Original PRD Overview]] \| [[Design/Original PRD API]] \| [[Design/Original PRD Semantic Search]] \| [[Design/Original PRD MCP]] \| [[Design/Original PRD Note Schema]]
+
+	---
+
+	## Component 2: Chroma Semantic Search Plugin
+
+	### Goal
+
+	Maintain a vector index of all wiki pages in ChromaDB, enabling semantic/similarity search. When a page is created, updated, or deleted, the index is updated automatically.
+
+	### Implementation approach
+
+	This should hook into Otterwiki's page save/delete lifecycle. Again, investigate whether the plugin hook system supports `after_save` / `after_delete` style hooks.
+
+	- If hooks exist for page lifecycle events: Build as a plugin.
+	- If not: Add hook calls into the `Page.save()` and `Page.delete()` methods in `wiki.py`, and build the Chroma logic as a plugin that registers for those hooks. Or, alternatively, have the API plugin handle indexing as a side effect of PUT/DELETE operations, and add a `/api/v1/reindex` endpoint for bulk rebuild.
+
+	### ChromaDB configuration
+
+	- Collection name: `otterwiki_pages`
+	- Embedding: Use Chroma's default `all-MiniLM-L6-v2` sentence-transformer (runs locally, no API key needed, small footprint). Note: verify exact max sequence length at implementation time — if it's 256 tokens, the chunking approach below handles it correctly regardless.
+	- Metadata stored per chunk: `{page_path, page_name, category, tags, last_updated, chunk_index}` — extracted from YAML frontmatter. The `page_path` and `chunk_index` fields are used for deduplication and reassembly.
+
+	### Chunking strategy
+
+	Pages are split into overlapping chunks for embedding. Each chunk is stored as a separate Chroma document. This ensures semantic search quality is independent of page length — a 300-word note and a 1500-word note are both fully indexed.
+
+	Chunking algorithm:
+
+	1. Strip YAML frontmatter from content (metadata is stored separately, not embedded).
+	2. Split on paragraph boundaries (double newline `\n\n`).
+	3. Accumulate paragraphs into chunks of ~200 tokens (~150 words). If a single paragraph exceeds 200 tokens, split it at sentence boundaries (`. ` followed by a capital letter or newline).
+	4. Add ~50 tokens of overlap between adjacent chunks — repeat the last 1–2 sentences of the previous chunk at the start of the next. This prevents concepts spanning a boundary from being lost.
+	5. Assign each chunk an ID: `{page_path}::chunk_{index}` (e.g., `Trends/Iran Attrition Strategy::chunk_0`).
+
+	Short pages: If the entire page body (after frontmatter) is under 200 tokens, store it as a single chunk. No need to split.
+
+	Example: A 600-word page might produce 4 chunks of ~150 words each, with ~35 words of overlap between adjacent chunks.
+
+	```python
+	def chunk_page(content: str, target_tokens: int = 200, overlap_tokens: int = 50) -> list[str]:
+	"""Split page content into overlapping chunks for embedding.
+
+	Args:
+	content: Page body text (frontmatter already stripped)
+	target_tokens: Approximate tokens per chunk (~0.75 words per token)
+	overlap_tokens: Approximate overlap between adjacent chunks
+
+	Returns:
+	List of chunk strings
+	"""
+	# Implementation: split on paragraphs, accumulate to target size,
+	# carry overlap from previous chunk. Fall back to sentence splitting
+	# for oversized paragraphs.
+	```
+
+	### Search result deduplication
+
+	Semantic search queries Chroma for the top `n * 3` chunks (to account for multiple chunks from the same page), then deduplicates by `page_path`, keeping the best-matching (lowest distance) chunk per page, and returns the top `n` unique pages.
+
+	The `snippet` in the search response is the text of the best-matching chunk for that page, truncated to ~150 characters. This means the snippet is contextually relevant to the query, not just the page's opening paragraph.
+
+	### API endpoints (added to the REST API)
+
+	\| Method \| Endpoint \| Description \|
+	\|--------\|----------\|-------------\|
+	\| `GET` \| `/api/v1/semantic-search?q=<query>&n=5` \| Semantic similarity search. Returns top N results as `{name, path, snippet, distance}`. Results are deduplicated by page. \|
+	\| `POST` \| `/api/v1/reindex` \| Rebuild the entire Chroma index from the Git repo. Deletes all existing chunks and re-indexes all pages. For initial population and recovery. \|
+
+	### Index maintenance
+
+	- On `PUT /api/v1/pages/<path>` (create/update): delete all existing chunks for that page path, then re-chunk and insert. This is simpler and safer than trying to diff chunks.
+	- On `DELETE /api/v1/pages/<path>`: delete all chunks for that page path.
+	- On page save via Otterwiki web UI: if hooks are available, also update Chroma. If not, run a periodic sync (see below).
+
+	### Fallback: periodic sync
+
+	If lifecycle hooks are unavailable or unreliable, implement a background sync that runs every 60 seconds:
+
+	1. `git log --since=<last_sync_time> --name-only` to find changed files
+	2. Re-index only those files in Chroma
+	3. Update `last_sync_time`
+
+	State persistence: `last_sync_time` is stored in a small file at `/app-data/chroma_sync_state.json` containing `{"last_sync": "2026-03-09T14:22:00Z"}`. This persists across container restarts.
+
+	First boot / missing state: If the state file doesn't exist, or if the Chroma collection is empty, perform a full reindex of all pages. This is the same operation as `POST /api/v1/reindex`.
+
+	Race condition mitigation: If a page is saved via the web UI and queried via semantic search within the sync window (up to 60 seconds), the search may return stale results. This is acceptable — the full-text search endpoint (`/api/v1/search`) reads directly from Git and is always current. The MCP server can fall back to full-text search when recency matters.
+
+	Implementation: Use a background thread started on Flask app initialization (e.g., `threading.Timer` with a recurring callback), NOT a cron job. This keeps everything in one process and avoids external dependencies.
+
+	This ensures edits made via the web UI are reflected in semantic search even without hooks.

Commit 44df18

Commit `44df18`