Blame
|
1 | --- |
||||||
| 2 | category: reference |
|||||||
| 3 | tags: [meta, design, prd, semantic-search] |
|||||||
| 4 | last_updated: 2026-03-12 |
|||||||
| 5 | confidence: high |
|||||||
| 6 | --- |
|||||||
| 7 | ||||||||
| 8 | # Original PRD Semantic Search |
|||||||
| 9 | ||||||||
| 10 | > This page is part of the original single-tenant PRD, split across five wiki pages: |
|||||||
|
11 | > [[Design/Research_Wiki]] | [[Design/Rest Api]] | [[Design/Semantic_Search]] | [[Design/Mcp Server]] | [[Design/Note_Schema]] |
||||||
|
12 | |||||||
| 13 | --- |
|||||||
| 14 | ||||||||
| 15 | ## Component 2: Chroma Semantic Search Plugin |
|||||||
| 16 | ||||||||
| 17 | ### Goal |
|||||||
| 18 | ||||||||
| 19 | Maintain a vector index of all wiki pages in ChromaDB, enabling semantic/similarity search. When a page is created, updated, or deleted, the index is updated automatically. |
|||||||
| 20 | ||||||||
| 21 | ### Implementation approach |
|||||||
| 22 | ||||||||
| 23 | This should hook into Otterwiki's page save/delete lifecycle. Again, investigate whether the plugin hook system supports `after_save` / `after_delete` style hooks. |
|||||||
| 24 | ||||||||
| 25 | - **If hooks exist for page lifecycle events:** Build as a plugin. |
|||||||
| 26 | - **If not:** Add hook calls into the `Page.save()` and `Page.delete()` methods in `wiki.py`, and build the Chroma logic as a plugin that registers for those hooks. Or, alternatively, have the API plugin handle indexing as a side effect of PUT/DELETE operations, and add a `/api/v1/reindex` endpoint for bulk rebuild. |
|||||||
| 27 | ||||||||
| 28 | ### ChromaDB configuration |
|||||||
| 29 | ||||||||
| 30 | - Collection name: `otterwiki_pages` |
|||||||
| 31 | - Embedding: Use Chroma's default `all-MiniLM-L6-v2` sentence-transformer (runs locally, no API key needed, small footprint). Note: verify exact max sequence length at implementation time — if it's 256 tokens, the chunking approach below handles it correctly regardless. |
|||||||
| 32 | - Metadata stored per chunk: `{page_path, page_name, category, tags, last_updated, chunk_index}` — extracted from YAML frontmatter. The `page_path` and `chunk_index` fields are used for deduplication and reassembly. |
|||||||
| 33 | ||||||||
| 34 | ### Chunking strategy |
|||||||
| 35 | ||||||||
| 36 | Pages are split into overlapping chunks for embedding. Each chunk is stored as a separate Chroma document. This ensures semantic search quality is independent of page length — a 300-word note and a 1500-word note are both fully indexed. |
|||||||
| 37 | ||||||||
| 38 | **Chunking algorithm:** |
|||||||
| 39 | ||||||||
| 40 | 1. Strip YAML frontmatter from content (metadata is stored separately, not embedded). |
|||||||
| 41 | 2. Split on paragraph boundaries (double newline `\n\n`). |
|||||||
| 42 | 3. Accumulate paragraphs into chunks of **~200 tokens** (~150 words). If a single paragraph exceeds 200 tokens, split it at sentence boundaries (`. ` followed by a capital letter or newline). |
|||||||
| 43 | 4. Add **~50 tokens of overlap** between adjacent chunks — repeat the last 1–2 sentences of the previous chunk at the start of the next. This prevents concepts spanning a boundary from being lost. |
|||||||
| 44 | 5. Assign each chunk an ID: `{page_path}::chunk_{index}` (e.g., `Trends/Iran Attrition Strategy::chunk_0`). |
|||||||
| 45 | ||||||||
| 46 | **Short pages:** If the entire page body (after frontmatter) is under 200 tokens, store it as a single chunk. No need to split. |
|||||||
| 47 | ||||||||
| 48 | **Example:** A 600-word page might produce 4 chunks of ~150 words each, with ~35 words of overlap between adjacent chunks. |
|||||||
| 49 | ||||||||
| 50 | ```python |
|||||||
| 51 | def chunk_page(content: str, target_tokens: int = 200, overlap_tokens: int = 50) -> list[str]: |
|||||||
| 52 | """Split page content into overlapping chunks for embedding. |
|||||||
| 53 | ||||||||
| 54 | Args: |
|||||||
| 55 | content: Page body text (frontmatter already stripped) |
|||||||
| 56 | target_tokens: Approximate tokens per chunk (~0.75 words per token) |
|||||||
| 57 | overlap_tokens: Approximate overlap between adjacent chunks |
|||||||
| 58 | ||||||||
| 59 | Returns: |
|||||||
| 60 | List of chunk strings |
|||||||
| 61 | """ |
|||||||
| 62 | # Implementation: split on paragraphs, accumulate to target size, |
|||||||
| 63 | # carry overlap from previous chunk. Fall back to sentence splitting |
|||||||
| 64 | # for oversized paragraphs. |
|||||||
| 65 | ``` |
|||||||
| 66 | ||||||||
| 67 | ### Search result deduplication |
|||||||
| 68 | ||||||||
| 69 | Semantic search queries Chroma for the top `n * 3` chunks (to account for multiple chunks from the same page), then deduplicates by `page_path`, keeping the best-matching (lowest distance) chunk per page, and returns the top `n` unique pages. |
|||||||
| 70 | ||||||||
| 71 | The `snippet` in the search response is the **text of the best-matching chunk** for that page, truncated to ~150 characters. This means the snippet is contextually relevant to the query, not just the page's opening paragraph. |
|||||||
| 72 | ||||||||
| 73 | ### API endpoints (added to the REST API) |
|||||||
| 74 | ||||||||
| 75 | | Method | Endpoint | Description | |
|||||||
| 76 | |--------|----------|-------------| |
|||||||
| 77 | | `GET` | `/api/v1/semantic-search?q=<query>&n=5` | Semantic similarity search. Returns top N results as `{name, path, snippet, distance}`. Results are deduplicated by page. | |
|||||||
| 78 | | `POST` | `/api/v1/reindex` | Rebuild the entire Chroma index from the Git repo. Deletes all existing chunks and re-indexes all pages. For initial population and recovery. | |
|||||||
| 79 | ||||||||
| 80 | ### Index maintenance |
|||||||
| 81 | ||||||||
| 82 | - On `PUT /api/v1/pages/<path>` (create/update): **delete all existing chunks** for that page path, then re-chunk and insert. This is simpler and safer than trying to diff chunks. |
|||||||
| 83 | - On `DELETE /api/v1/pages/<path>`: delete all chunks for that page path. |
|||||||
| 84 | - On page save via Otterwiki web UI: if hooks are available, also update Chroma. If not, run a periodic sync (see below). |
|||||||
| 85 | ||||||||
| 86 | ### Fallback: periodic sync |
|||||||
| 87 | ||||||||
| 88 | If lifecycle hooks are unavailable or unreliable, implement a background sync that runs every 60 seconds: |
|||||||
| 89 | ||||||||
| 90 | 1. `git log --since=<last_sync_time> --name-only` to find changed files |
|||||||
| 91 | 2. Re-index only those files in Chroma |
|||||||
| 92 | 3. Update `last_sync_time` |
|||||||
| 93 | ||||||||
| 94 | **State persistence:** `last_sync_time` is stored in a small file at `/app-data/chroma_sync_state.json` containing `{"last_sync": "2026-03-09T14:22:00Z"}`. This persists across container restarts. |
|||||||
| 95 | ||||||||
| 96 | **First boot / missing state:** If the state file doesn't exist, or if the Chroma collection is empty, perform a full reindex of all pages. This is the same operation as `POST /api/v1/reindex`. |
|||||||
| 97 | ||||||||
| 98 | **Race condition mitigation:** If a page is saved via the web UI and queried via semantic search within the sync window (up to 60 seconds), the search may return stale results. This is acceptable — the full-text search endpoint (`/api/v1/search`) reads directly from Git and is always current. The MCP server can fall back to full-text search when recency matters. |
|||||||
| 99 | ||||||||
| 100 | **Implementation:** Use a background thread started on Flask app initialization (e.g., `threading.Timer` with a recurring callback), NOT a cron job. This keeps everything in one process and avoids external dependencies. |
|||||||
| 101 | ||||||||
| 102 | This ensures edits made via the web UI are reflected in semantic search even without hooks. |
|||||||