Commit 21f5dc
2026-03-16 17:36:36 Claude (MCP): [mcp] [design] Add Semantic Search V2 design: section-aware chunking, full chunk text in results, section-level reads| /dev/null .. Design/Semantic_Search_V2.md | |
| @@ 0,0 1,165 @@ | |
| + | --- |
| + | category: spec |
| + | tags: [design, semantic-search, mcp] |
| + | last_updated: 2026-03-16 |
| + | confidence: high |
| + | --- |
| + | |
| + | # Semantic Search V2: Section-Aware Chunking and Targeted Reads |
| + | |
| + | This page describes planned improvements to `otterwiki-semantic-search` and `otterwiki-mcp` to make semantic search more useful to agent consumers (Claude.ai, Claude Code). It supersedes the chunking and result format sections of [[Design/Semantic_Search]]. |
| + | |
| + | See also: [[Tasks/Semantic_Search_Architecture]] (multi-tenant issues), the empirical findings on the 3GW wiki at `Meta/Page Size And Search Quality`. |
| + | |
| + | ## Problem |
| + | |
| + | The current semantic search pipeline has three compounding weaknesses for agent use: |
| + | |
| + | 1. **Chunks straddle topic boundaries.** The chunker splits on paragraph breaks (`\n\n`) with no awareness of markdown headings. A 500-word section under `## Russian Substitution Constraints` gets split into 3-4 chunks, and chunks near section boundaries blend content from adjacent, unrelated sections. The resulting embeddings are diluted and match poorly. |
| + | |
| + | 2. **Search results are truncated to 150 characters.** Chunks are ~150 words, but the search API truncates the returned snippet to 150 *characters* — discarding ~80% of the retrieved content. The agent can't evaluate relevance from a sentence fragment, so it almost always follows up with `read_note`, which loads the *entire* page. |
| + | |
| + | 3. **`read_note` is all-or-nothing.** Once the agent decides it needs more context than the snippet provides, the only option is loading the full page. A 4,000-word page costs ~5,000 tokens of context window to retrieve a 500-word section. |
| + | |
| + | The net effect: semantic search routes to the right page but the agent pays full context cost anyway. The search step adds latency without saving tokens. |
| + | |
| + | ## Constraints |
| + | |
| + | **MiniLM-L6-v2 has a 256 wordpiece token context window.** Input beyond 256 tokens is silently truncated — it does not contribute to the embedding. At ~1.3 tokens per word, this means effective content per chunk is capped at ~190 words. The current `TARGET_WORDS = 150` fits within this limit with room for metadata prefixes. |
| + | |
| + | Any chunk size increase requires switching to a model with a longer context window (e.g., E5-small-v2 or BGE-small at 512 tokens). This design keeps MiniLM and works within the 256-token budget. |
| + | |
| + | ## Design |
| + | |
| + | ### Change 1: Section-aware chunking |
| + | |
| + | **In:** `otterwiki-semantic-search`, `chunking.py` |
| + | |
| + | Replace paragraph-only splitting with heading-aware splitting: |
| + | |
| + | 1. Strip YAML frontmatter (unchanged). |
| + | 2. Parse the markdown into sections by splitting on heading lines (`^#{1,6}\s`). Track a **header stack** — the path of headings from the page title down to the current section (e.g., `["Fertilizer Supply Crisis", "Russian Substitution"]`). |
| + | 3. Within each section, apply the existing paragraph-accumulation algorithm (target ~150 words, sentence-boundary fallback for oversized paragraphs). |
| + | 4. **Hard rule:** Never merge content from different sections into the same chunk. A section boundary is always a chunk boundary, even if the preceding chunk is short. |
| + | 5. **Floor:** If a section is under ~50 words, merge it with the next section at the same or deeper heading level. This prevents stub headings from producing uselessly small chunks. |
| + | 6. **Overlap:** Continue the 35-word overlap between chunks *within* the same section. Do not carry overlap across section boundaries — the header prefix (below) provides sufficient context bridging. |
| + | |
| + | **Header prefix:** Prepend the header path to each chunk's text before embedding: |
| + | |
| + | ``` |
| + | [Fertilizer Supply Crisis > Russian Substitution] Russia cannot substitute... |
| + | ``` |
| + | |
| + | The bracketed prefix costs ~10-20 wordpiece tokens, reducing effective content to ~130 words per chunk. This is acceptable — a topically coherent 130-word chunk with a descriptive prefix produces a sharper embedding than a topically mixed 150-word chunk without one. |
| + | |
| + | **Chunk metadata** gains a new field: |
| + | |
| + | ```json |
| + | { |
| + | "page_path": "Trends/Fertilizer Supply Crisis", |
| + | "chunk_index": 2, |
| + | "section": "Russian Substitution", |
| + | "section_path": ["Fertilizer Supply Crisis", "Russian Substitution"], |
| + | "title": "Fertilizer Supply Crisis", |
| + | "category": "trend", |
| + | "tags": "economics, agriculture" |
| + | } |
| + | ``` |
| + | |
| + | ### Change 2: Return full chunk text in search results |
| + | |
| + | **In:** `otterwiki-semantic-search`, `index.py` |
| + | |
| + | Remove the 150-character snippet truncation. The search API returns the full chunk text (~150 words) as the `snippet` field. This is small enough to be cheap in context and large enough to evaluate relevance without a follow-up read. |
| + | |
| + | **Response format** (additions in bold): |
| + | |
| + | ```json |
| + | { |
| + | "query": "Russian fertilizer substitution", |
| + | "results": [ |
| + | { |
| + | "name": "Trends/Fertilizer Supply Crisis", |
| + | "path": "Trends/Fertilizer Supply Crisis", |
| + | "snippet": "[Fertilizer Supply Crisis > Russian Substitution] Russia cannot substitute...(full ~150 words)...", |
| + | "distance": 0.42, |
| + | "section": "Russian Substitution", |
| + | "section_path": ["Fertilizer Supply Crisis", "Russian Substitution"], |
| + | "chunk_index": 2, |
| + | "total_chunks": 12, |
| + | "page_word_count": 4188 |
| + | } |
| + | ], |
| + | "total": 1 |
| + | } |
| + | ``` |
| + | |
| + | New fields: |
| + | - **`section`** / **`section_path`** — where in the page this chunk lives. Gives the agent a handle for targeted reads (Change 4). |
| + | - **`chunk_index`** / **`total_chunks`** — positional context. |
| + | - **`page_word_count`** — lets the agent estimate the context cost of a full `read_note` and decide whether it's worth it. |
| + | |
| + | ### Change 3: Configurable per-page deduplication |
| + | |
| + | **In:** `otterwiki-semantic-search`, `index.py` |
| + | |
| + | Currently the search deduplicates to one chunk per page. This is too aggressive — if three sections of a page are relevant, the agent sees only one. |
| + | |
| + | Add a `max_chunks_per_page` parameter (default 2, max 5) to the search API: |
| + | |
| + | ``` |
| + | GET /api/v1/semantic-search?q=economic+transmission&n=5&max_chunks_per_page=3 |
| + | ``` |
| + | |
| + | The deduplication logic changes from "keep best chunk per page" to "keep best N chunks per page." Total results are still capped at `n`. |
| + | |
| + | Default of 2 balances breadth (seeing multiple pages) against depth (seeing multiple sections of the most relevant page). |
| + | |
| + | ### Change 4: Section-level read via MCP |
| + | |
| + | **In:** `otterwiki-mcp` (or `otterwiki-api` REST plugin) |
| + | |
| + | Add a `section` parameter to the `read_note` MCP tool: |
| + | |
| + | ``` |
| + | read_note(path="Trends/Fertilizer Supply Crisis", section="Russian Substitution") |
| + | ``` |
| + | |
| + | **Behavior:** |
| + | |
| + | 1. Load the full page content. |
| + | 2. Parse markdown headings into a tree. |
| + | 3. Find the section matching the `section` parameter. Match against heading text, case-insensitive. If ambiguous (multiple headings with the same text), accept a `/`-delimited path: `"Country Dependencies/Pakistan"`. |
| + | 4. Return everything from the matched heading to the next heading at the same or higher level. |
| + | 5. Include the heading itself in the returned content. |
| + | 6. If no match, return an error listing available sections (so the agent can retry with the correct name). |
| + | |
| + | **Why in the MCP layer, not the REST API:** The REST API serves multiple consumers. Section-level reads are an agent UX optimization — the MCP tool can implement it by fetching the full page from the REST API and slicing locally. This avoids adding complexity to the API surface. |
| + | |
| + | **Alternative considered:** Returning multiple sections in one call (e.g., `sections=["Russian Substitution", "Planting Window"]`). Deferred — the common case is one section per call, and multiple calls are cheap. |
| + | |
| + | ## Agent workflow after these changes |
| + | |
| + | 1. **`semantic_search("Russian fertilizer constraints")`** — returns 5 results with full chunk text, section paths, and page word counts. |
| + | 2. Agent reads the snippets. Two chunks from `Fertilizer Supply Crisis` are relevant (sections "Russian Substitution" and "Priority Queue"). One chunk from `P4 Economic Transmission` is relevant. |
| + | 3. For the 500-word sections where the 150-word snippet isn't enough: **`read_note("Trends/Fertilizer Supply Crisis", section="Russian Substitution")`** — returns ~500 words instead of ~4,200. |
| + | 4. Agent has the context it needs. Total cost: ~1,500 tokens (5 snippets + 1 section read) vs. ~6,500 tokens today (5 truncated snippets + 1 full page load that's mostly irrelevant). |
| + | |
| + | ## Implementation scope |
| + | |
| + | | Change | Repo | Files | Complexity | |
| + | |--------|------|-------|------------| |
| + | | Section-aware chunking | otterwiki-semantic-search | `chunking.py`, tests | Medium — new heading parser, preserve existing paragraph logic within sections | |
| + | | Full chunk text + metadata | otterwiki-semantic-search | `index.py`, `routes.py`, tests | Low — remove truncation, add fields to response | |
| + | | Configurable dedup | otterwiki-semantic-search | `index.py`, `routes.py`, tests | Low — parameterize existing logic | |
| + | | Section-level read | otterwiki-mcp | MCP tool definition, markdown parser | Medium — heading tree parser, error handling for ambiguous matches | |
| + | |
| + | All changes are backward-compatible. Existing consumers see richer results but don't break. The `section` parameter on `read_note` is optional. |
| + | |
| + | **Reindexing:** Changes 1-3 require a full reindex after deployment. The new chunk boundaries and metadata fields are only populated for newly indexed content. `POST /api/v1/reindex` handles this. |
| + | |
| + | ## What this design does NOT address |
| + | |
| + | - **Embedding model upgrade.** MiniLM-L6-v2's 256-token window is a real constraint but adequate for ~150-word chunks with header prefixes. A model upgrade (to 512-token context) would allow larger chunks and is worth evaluating separately. |
| + | - **Multi-tenant indexing.** Tracked in [[Tasks/Semantic_Search_Architecture]] and [[Tasks/Semantic_Search_Multi_Tenant]]. Orthogonal to this work. |
| + | - **In-process embedding risks.** The ONNX model in the gunicorn worker and daemon thread shutdown are operational concerns, not search quality concerns. Tracked in [[Tasks/Semantic_Search_Architecture]]. |