Blame
|
1 | --- |
||||||
| 2 | category: spec |
|||||||
| 3 | tags: [design, semantic-search, mcp] |
|||||||
| 4 | last_updated: 2026-03-16 |
|||||||
| 5 | confidence: high |
|||||||
| 6 | --- |
|||||||
| 7 | ||||||||
| 8 | # Semantic Search V2: Section-Aware Chunking and Targeted Reads |
|||||||
| 9 | ||||||||
| 10 | This page describes planned improvements to `otterwiki-semantic-search` and `otterwiki-mcp` to make semantic search more useful to agent consumers (Claude.ai, Claude Code). It supersedes the chunking and result format sections of [[Design/Semantic_Search]]. |
|||||||
| 11 | ||||||||
| 12 | See also: [[Tasks/Semantic_Search_Architecture]] (multi-tenant issues), the empirical findings on the 3GW wiki at `Meta/Page Size And Search Quality`. |
|||||||
| 13 | ||||||||
| 14 | ## Problem |
|||||||
| 15 | ||||||||
| 16 | The current semantic search pipeline has three compounding weaknesses for agent use: |
|||||||
| 17 | ||||||||
| 18 | 1. **Chunks straddle topic boundaries.** The chunker splits on paragraph breaks (`\n\n`) with no awareness of markdown headings. A 500-word section under `## Russian Substitution Constraints` gets split into 3-4 chunks, and chunks near section boundaries blend content from adjacent, unrelated sections. The resulting embeddings are diluted and match poorly. |
|||||||
| 19 | ||||||||
| 20 | 2. **Search results are truncated to 150 characters.** Chunks are ~150 words, but the search API truncates the returned snippet to 150 *characters* — discarding ~80% of the retrieved content. The agent can't evaluate relevance from a sentence fragment, so it almost always follows up with `read_note`, which loads the *entire* page. |
|||||||
| 21 | ||||||||
| 22 | 3. **`read_note` is all-or-nothing.** Once the agent decides it needs more context than the snippet provides, the only option is loading the full page. A 4,000-word page costs ~5,000 tokens of context window to retrieve a 500-word section. |
|||||||
| 23 | ||||||||
| 24 | The net effect: semantic search routes to the right page but the agent pays full context cost anyway. The search step adds latency without saving tokens. |
|||||||
| 25 | ||||||||
| 26 | ## Constraints |
|||||||
| 27 | ||||||||
| 28 | **MiniLM-L6-v2 has a 256 wordpiece token context window.** Input beyond 256 tokens is silently truncated — it does not contribute to the embedding. At ~1.3 tokens per word, this means effective content per chunk is capped at ~190 words. The current `TARGET_WORDS = 150` fits within this limit with room for metadata prefixes. |
|||||||
| 29 | ||||||||
| 30 | Any chunk size increase requires switching to a model with a longer context window (e.g., E5-small-v2 or BGE-small at 512 tokens). This design keeps MiniLM and works within the 256-token budget. |
|||||||
| 31 | ||||||||
| 32 | ## Design |
|||||||
| 33 | ||||||||
| 34 | ### Change 1: Section-aware chunking |
|||||||
| 35 | ||||||||
| 36 | **In:** `otterwiki-semantic-search`, `chunking.py` |
|||||||
| 37 | ||||||||
| 38 | Replace paragraph-only splitting with heading-aware splitting: |
|||||||
| 39 | ||||||||
| 40 | 1. Strip YAML frontmatter (unchanged). |
|||||||
| 41 | 2. Parse the markdown into sections by splitting on heading lines (`^#{1,6}\s`). Track a **header stack** — the path of headings from the page title down to the current section (e.g., `["Fertilizer Supply Crisis", "Russian Substitution"]`). |
|||||||
| 42 | 3. Within each section, apply the existing paragraph-accumulation algorithm (target ~150 words, sentence-boundary fallback for oversized paragraphs). |
|||||||
| 43 | 4. **Hard rule:** Never merge content from different sections into the same chunk. A section boundary is always a chunk boundary, even if the preceding chunk is short. |
|||||||
| 44 | 5. **Floor:** If a section is under ~50 words, merge it with the next section at the same or deeper heading level. This prevents stub headings from producing uselessly small chunks. |
|||||||
| 45 | 6. **Overlap:** Continue the 35-word overlap between chunks *within* the same section. Do not carry overlap across section boundaries — the header prefix (below) provides sufficient context bridging. |
|||||||
| 46 | ||||||||
| 47 | **Header prefix:** Prepend the header path to each chunk's text before embedding: |
|||||||
| 48 | ||||||||
| 49 | ``` |
|||||||
| 50 | [Fertilizer Supply Crisis > Russian Substitution] Russia cannot substitute... |
|||||||
| 51 | ``` |
|||||||
| 52 | ||||||||
| 53 | The bracketed prefix costs ~10-20 wordpiece tokens, reducing effective content to ~130 words per chunk. This is acceptable — a topically coherent 130-word chunk with a descriptive prefix produces a sharper embedding than a topically mixed 150-word chunk without one. |
|||||||
| 54 | ||||||||
| 55 | **Chunk metadata** gains a new field: |
|||||||
| 56 | ||||||||
| 57 | ```json |
|||||||
| 58 | { |
|||||||
| 59 | "page_path": "Trends/Fertilizer Supply Crisis", |
|||||||
| 60 | "chunk_index": 2, |
|||||||
| 61 | "section": "Russian Substitution", |
|||||||
| 62 | "section_path": ["Fertilizer Supply Crisis", "Russian Substitution"], |
|||||||
| 63 | "title": "Fertilizer Supply Crisis", |
|||||||
| 64 | "category": "trend", |
|||||||
| 65 | "tags": "economics, agriculture" |
|||||||
| 66 | } |
|||||||
| 67 | ``` |
|||||||
| 68 | ||||||||
| 69 | ### Change 2: Return full chunk text in search results |
|||||||
| 70 | ||||||||
| 71 | **In:** `otterwiki-semantic-search`, `index.py` |
|||||||
| 72 | ||||||||
| 73 | Remove the 150-character snippet truncation. The search API returns the full chunk text (~150 words) as the `snippet` field. This is small enough to be cheap in context and large enough to evaluate relevance without a follow-up read. |
|||||||
| 74 | ||||||||
| 75 | **Response format** (additions in bold): |
|||||||
| 76 | ||||||||
| 77 | ```json |
|||||||
| 78 | { |
|||||||
| 79 | "query": "Russian fertilizer substitution", |
|||||||
| 80 | "results": [ |
|||||||
| 81 | { |
|||||||
| 82 | "name": "Trends/Fertilizer Supply Crisis", |
|||||||
| 83 | "path": "Trends/Fertilizer Supply Crisis", |
|||||||
| 84 | "snippet": "[Fertilizer Supply Crisis > Russian Substitution] Russia cannot substitute...(full ~150 words)...", |
|||||||
| 85 | "distance": 0.42, |
|||||||
| 86 | "section": "Russian Substitution", |
|||||||
| 87 | "section_path": ["Fertilizer Supply Crisis", "Russian Substitution"], |
|||||||
| 88 | "chunk_index": 2, |
|||||||
| 89 | "total_chunks": 12, |
|||||||
| 90 | "page_word_count": 4188 |
|||||||
| 91 | } |
|||||||
| 92 | ], |
|||||||
| 93 | "total": 1 |
|||||||
| 94 | } |
|||||||
| 95 | ``` |
|||||||
| 96 | ||||||||
| 97 | New fields: |
|||||||
| 98 | - **`section`** / **`section_path`** — where in the page this chunk lives. Gives the agent a handle for targeted reads (Change 4). |
|||||||
| 99 | - **`chunk_index`** / **`total_chunks`** — positional context. |
|||||||
| 100 | - **`page_word_count`** — lets the agent estimate the context cost of a full `read_note` and decide whether it's worth it. |
|||||||
| 101 | ||||||||
| 102 | ### Change 3: Configurable per-page deduplication |
|||||||
| 103 | ||||||||
| 104 | **In:** `otterwiki-semantic-search`, `index.py` |
|||||||
| 105 | ||||||||
| 106 | Currently the search deduplicates to one chunk per page. This is too aggressive — if three sections of a page are relevant, the agent sees only one. |
|||||||
| 107 | ||||||||
| 108 | Add a `max_chunks_per_page` parameter (default 2, max 5) to the search API: |
|||||||
| 109 | ||||||||
| 110 | ``` |
|||||||
| 111 | GET /api/v1/semantic-search?q=economic+transmission&n=5&max_chunks_per_page=3 |
|||||||
| 112 | ``` |
|||||||
| 113 | ||||||||
| 114 | The deduplication logic changes from "keep best chunk per page" to "keep best N chunks per page." Total results are still capped at `n`. |
|||||||
| 115 | ||||||||
| 116 | Default of 2 balances breadth (seeing multiple pages) against depth (seeing multiple sections of the most relevant page). |
|||||||
| 117 | ||||||||
| 118 | ### Change 4: Section-level read via MCP |
|||||||
| 119 | ||||||||
| 120 | **In:** `otterwiki-mcp` (or `otterwiki-api` REST plugin) |
|||||||
| 121 | ||||||||
| 122 | Add a `section` parameter to the `read_note` MCP tool: |
|||||||
| 123 | ||||||||
| 124 | ``` |
|||||||
| 125 | read_note(path="Trends/Fertilizer Supply Crisis", section="Russian Substitution") |
|||||||
| 126 | ``` |
|||||||
| 127 | ||||||||
| 128 | **Behavior:** |
|||||||
| 129 | ||||||||
| 130 | 1. Load the full page content. |
|||||||
| 131 | 2. Parse markdown headings into a tree. |
|||||||
| 132 | 3. Find the section matching the `section` parameter. Match against heading text, case-insensitive. If ambiguous (multiple headings with the same text), accept a `/`-delimited path: `"Country Dependencies/Pakistan"`. |
|||||||
| 133 | 4. Return everything from the matched heading to the next heading at the same or higher level. |
|||||||
| 134 | 5. Include the heading itself in the returned content. |
|||||||
| 135 | 6. If no match, return an error listing available sections (so the agent can retry with the correct name). |
|||||||
| 136 | ||||||||
| 137 | **Why in the MCP layer, not the REST API:** The REST API serves multiple consumers. Section-level reads are an agent UX optimization — the MCP tool can implement it by fetching the full page from the REST API and slicing locally. This avoids adding complexity to the API surface. |
|||||||
| 138 | ||||||||
| 139 | **Alternative considered:** Returning multiple sections in one call (e.g., `sections=["Russian Substitution", "Planting Window"]`). Deferred — the common case is one section per call, and multiple calls are cheap. |
|||||||
| 140 | ||||||||
| 141 | ## Agent workflow after these changes |
|||||||
| 142 | ||||||||
| 143 | 1. **`semantic_search("Russian fertilizer constraints")`** — returns 5 results with full chunk text, section paths, and page word counts. |
|||||||
| 144 | 2. Agent reads the snippets. Two chunks from `Fertilizer Supply Crisis` are relevant (sections "Russian Substitution" and "Priority Queue"). One chunk from `P4 Economic Transmission` is relevant. |
|||||||
| 145 | 3. For the 500-word sections where the 150-word snippet isn't enough: **`read_note("Trends/Fertilizer Supply Crisis", section="Russian Substitution")`** — returns ~500 words instead of ~4,200. |
|||||||
| 146 | 4. Agent has the context it needs. Total cost: ~1,500 tokens (5 snippets + 1 section read) vs. ~6,500 tokens today (5 truncated snippets + 1 full page load that's mostly irrelevant). |
|||||||
| 147 | ||||||||
| 148 | ## Implementation scope |
|||||||
| 149 | ||||||||
| 150 | | Change | Repo | Files | Complexity | |
|||||||
| 151 | |--------|------|-------|------------| |
|||||||
| 152 | | Section-aware chunking | otterwiki-semantic-search | `chunking.py`, tests | Medium — new heading parser, preserve existing paragraph logic within sections | |
|||||||
| 153 | | Full chunk text + metadata | otterwiki-semantic-search | `index.py`, `routes.py`, tests | Low — remove truncation, add fields to response | |
|||||||
| 154 | | Configurable dedup | otterwiki-semantic-search | `index.py`, `routes.py`, tests | Low — parameterize existing logic | |
|||||||
| 155 | | Section-level read | otterwiki-mcp | MCP tool definition, markdown parser | Medium — heading tree parser, error handling for ambiguous matches | |
|||||||
| 156 | ||||||||
| 157 | All changes are backward-compatible. Existing consumers see richer results but don't break. The `section` parameter on `read_note` is optional. |
|||||||
| 158 | ||||||||
| 159 | **Reindexing:** Changes 1-3 require a full reindex after deployment. The new chunk boundaries and metadata fields are only populated for newly indexed content. `POST /api/v1/reindex` handles this. |
|||||||
| 160 | ||||||||
|
161 | ## Deployment notes |
||||||
| 162 | ||||||||
| 163 | **Reindex is mandatory.** Deploy the new `otterwiki-semantic-search` code, then immediately `POST /api/v1/reindex` on each wiki instance. Until reindex completes, old-format chunks (missing `section`, `section_path`, `total_chunks`, `page_word_count`) will return `None` for those fields in search results. The search layer uses `.get()` so it won't crash, but the enriched MCP formatting will degrade silently. |
|||||||
| 164 | ||||||||
| 165 | **FAISS sidecar growth.** New metadata fields add ~160 bytes per chunk to `embeddings.json`. For a 10,000-chunk index, the sidecar grows from ~1.4MB to ~2.9MB. The FAISS backend loads the full sidecar into memory on startup and re-serializes on every upsert. This is acceptable for current corpus sizes but worth monitoring for large multi-tenant deployments. |
|||||||
| 166 | ||||||||
| 167 | **Heading content in results.** `section`, `section_path`, and the `[prefix]` in `text`/`snippet` contain raw heading text from wiki pages. Consumers rendering these fields as HTML must escape them. The API returns JSON (`Content-Type: application/json`), so the API layer itself is safe. |
|||||||
| 168 | ||||||||
| 169 | **Deploy order.** `otterwiki-semantic-search` must deploy and reindex before `otterwiki-mcp` changes are useful. The MCP `section` parameter on `read_note` is independent (parses content client-side), but `format_semantic_results` expects the new result fields which only appear after reindex. |
|||||||
| 170 | ||||||||
|
171 | ## What this design does NOT address |
||||||
| 172 | ||||||||
| 173 | - **Embedding model upgrade.** MiniLM-L6-v2's 256-token window is a real constraint but adequate for ~150-word chunks with header prefixes. A model upgrade (to 512-token context) would allow larger chunks and is worth evaluating separately. |
|||||||
| 174 | - **Multi-tenant indexing.** Tracked in [[Tasks/Semantic_Search_Architecture]] and [[Tasks/Semantic_Search_Multi_Tenant]]. Orthogonal to this work. |
|||||||
| 175 | - **In-process embedding risks.** The ONNX model in the gunicorn worker and daemon thread shutdown are operational concerns, not search quality concerns. Tracked in [[Tasks/Semantic_Search_Architecture]]. |
|||||||