Commit 21f5dc

2026-03-16 17:36:36 Claude (MCP): [mcp] [design] Add Semantic Search V2 design: section-aware chunking, full chunk text in results, section-level reads
/dev/null .. Design/Semantic_Search_V2.md
@@ 0,0 1,165 @@
+ ---
+ category: spec
+ tags: [design, semantic-search, mcp]
+ last_updated: 2026-03-16
+ confidence: high
+ ---
+
+ # Semantic Search V2: Section-Aware Chunking and Targeted Reads
+
+ This page describes planned improvements to `otterwiki-semantic-search` and `otterwiki-mcp` to make semantic search more useful to agent consumers (Claude.ai, Claude Code). It supersedes the chunking and result format sections of [[Design/Semantic_Search]].
+
+ See also: [[Tasks/Semantic_Search_Architecture]] (multi-tenant issues), the empirical findings on the 3GW wiki at `Meta/Page Size And Search Quality`.
+
+ ## Problem
+
+ The current semantic search pipeline has three compounding weaknesses for agent use:
+
+ 1. **Chunks straddle topic boundaries.** The chunker splits on paragraph breaks (`\n\n`) with no awareness of markdown headings. A 500-word section under `## Russian Substitution Constraints` gets split into 3-4 chunks, and chunks near section boundaries blend content from adjacent, unrelated sections. The resulting embeddings are diluted and match poorly.
+
+ 2. **Search results are truncated to 150 characters.** Chunks are ~150 words, but the search API truncates the returned snippet to 150 *characters* — discarding ~80% of the retrieved content. The agent can't evaluate relevance from a sentence fragment, so it almost always follows up with `read_note`, which loads the *entire* page.
+
+ 3. **`read_note` is all-or-nothing.** Once the agent decides it needs more context than the snippet provides, the only option is loading the full page. A 4,000-word page costs ~5,000 tokens of context window to retrieve a 500-word section.
+
+ The net effect: semantic search routes to the right page but the agent pays full context cost anyway. The search step adds latency without saving tokens.
+
+ ## Constraints
+
+ **MiniLM-L6-v2 has a 256 wordpiece token context window.** Input beyond 256 tokens is silently truncated — it does not contribute to the embedding. At ~1.3 tokens per word, this means effective content per chunk is capped at ~190 words. The current `TARGET_WORDS = 150` fits within this limit with room for metadata prefixes.
+
+ Any chunk size increase requires switching to a model with a longer context window (e.g., E5-small-v2 or BGE-small at 512 tokens). This design keeps MiniLM and works within the 256-token budget.
+
+ ## Design
+
+ ### Change 1: Section-aware chunking
+
+ **In:** `otterwiki-semantic-search`, `chunking.py`
+
+ Replace paragraph-only splitting with heading-aware splitting:
+
+ 1. Strip YAML frontmatter (unchanged).
+ 2. Parse the markdown into sections by splitting on heading lines (`^#{1,6}\s`). Track a **header stack** — the path of headings from the page title down to the current section (e.g., `["Fertilizer Supply Crisis", "Russian Substitution"]`).
+ 3. Within each section, apply the existing paragraph-accumulation algorithm (target ~150 words, sentence-boundary fallback for oversized paragraphs).
+ 4. **Hard rule:** Never merge content from different sections into the same chunk. A section boundary is always a chunk boundary, even if the preceding chunk is short.
+ 5. **Floor:** If a section is under ~50 words, merge it with the next section at the same or deeper heading level. This prevents stub headings from producing uselessly small chunks.
+ 6. **Overlap:** Continue the 35-word overlap between chunks *within* the same section. Do not carry overlap across section boundaries — the header prefix (below) provides sufficient context bridging.
+
+ **Header prefix:** Prepend the header path to each chunk's text before embedding:
+
+ ```
+ [Fertilizer Supply Crisis > Russian Substitution] Russia cannot substitute...
+ ```
+
+ The bracketed prefix costs ~10-20 wordpiece tokens, reducing effective content to ~130 words per chunk. This is acceptable — a topically coherent 130-word chunk with a descriptive prefix produces a sharper embedding than a topically mixed 150-word chunk without one.
+
+ **Chunk metadata** gains a new field:
+
+ ```json
+ {
+ "page_path": "Trends/Fertilizer Supply Crisis",
+ "chunk_index": 2,
+ "section": "Russian Substitution",
+ "section_path": ["Fertilizer Supply Crisis", "Russian Substitution"],
+ "title": "Fertilizer Supply Crisis",
+ "category": "trend",
+ "tags": "economics, agriculture"
+ }
+ ```
+
+ ### Change 2: Return full chunk text in search results
+
+ **In:** `otterwiki-semantic-search`, `index.py`
+
+ Remove the 150-character snippet truncation. The search API returns the full chunk text (~150 words) as the `snippet` field. This is small enough to be cheap in context and large enough to evaluate relevance without a follow-up read.
+
+ **Response format** (additions in bold):
+
+ ```json
+ {
+ "query": "Russian fertilizer substitution",
+ "results": [
+ {
+ "name": "Trends/Fertilizer Supply Crisis",
+ "path": "Trends/Fertilizer Supply Crisis",
+ "snippet": "[Fertilizer Supply Crisis > Russian Substitution] Russia cannot substitute...(full ~150 words)...",
+ "distance": 0.42,
+ "section": "Russian Substitution",
+ "section_path": ["Fertilizer Supply Crisis", "Russian Substitution"],
+ "chunk_index": 2,
+ "total_chunks": 12,
+ "page_word_count": 4188
+ }
+ ],
+ "total": 1
+ }
+ ```
+
+ New fields:
+ - **`section`** / **`section_path`** — where in the page this chunk lives. Gives the agent a handle for targeted reads (Change 4).
+ - **`chunk_index`** / **`total_chunks`** — positional context.
+ - **`page_word_count`** — lets the agent estimate the context cost of a full `read_note` and decide whether it's worth it.
+
+ ### Change 3: Configurable per-page deduplication
+
+ **In:** `otterwiki-semantic-search`, `index.py`
+
+ Currently the search deduplicates to one chunk per page. This is too aggressive — if three sections of a page are relevant, the agent sees only one.
+
+ Add a `max_chunks_per_page` parameter (default 2, max 5) to the search API:
+
+ ```
+ GET /api/v1/semantic-search?q=economic+transmission&n=5&max_chunks_per_page=3
+ ```
+
+ The deduplication logic changes from "keep best chunk per page" to "keep best N chunks per page." Total results are still capped at `n`.
+
+ Default of 2 balances breadth (seeing multiple pages) against depth (seeing multiple sections of the most relevant page).
+
+ ### Change 4: Section-level read via MCP
+
+ **In:** `otterwiki-mcp` (or `otterwiki-api` REST plugin)
+
+ Add a `section` parameter to the `read_note` MCP tool:
+
+ ```
+ read_note(path="Trends/Fertilizer Supply Crisis", section="Russian Substitution")
+ ```
+
+ **Behavior:**
+
+ 1. Load the full page content.
+ 2. Parse markdown headings into a tree.
+ 3. Find the section matching the `section` parameter. Match against heading text, case-insensitive. If ambiguous (multiple headings with the same text), accept a `/`-delimited path: `"Country Dependencies/Pakistan"`.
+ 4. Return everything from the matched heading to the next heading at the same or higher level.
+ 5. Include the heading itself in the returned content.
+ 6. If no match, return an error listing available sections (so the agent can retry with the correct name).
+
+ **Why in the MCP layer, not the REST API:** The REST API serves multiple consumers. Section-level reads are an agent UX optimization — the MCP tool can implement it by fetching the full page from the REST API and slicing locally. This avoids adding complexity to the API surface.
+
+ **Alternative considered:** Returning multiple sections in one call (e.g., `sections=["Russian Substitution", "Planting Window"]`). Deferred — the common case is one section per call, and multiple calls are cheap.
+
+ ## Agent workflow after these changes
+
+ 1. **`semantic_search("Russian fertilizer constraints")`** — returns 5 results with full chunk text, section paths, and page word counts.
+ 2. Agent reads the snippets. Two chunks from `Fertilizer Supply Crisis` are relevant (sections "Russian Substitution" and "Priority Queue"). One chunk from `P4 Economic Transmission` is relevant.
+ 3. For the 500-word sections where the 150-word snippet isn't enough: **`read_note("Trends/Fertilizer Supply Crisis", section="Russian Substitution")`** — returns ~500 words instead of ~4,200.
+ 4. Agent has the context it needs. Total cost: ~1,500 tokens (5 snippets + 1 section read) vs. ~6,500 tokens today (5 truncated snippets + 1 full page load that's mostly irrelevant).
+
+ ## Implementation scope
+
+ | Change | Repo | Files | Complexity |
+ |--------|------|-------|------------|
+ | Section-aware chunking | otterwiki-semantic-search | `chunking.py`, tests | Medium — new heading parser, preserve existing paragraph logic within sections |
+ | Full chunk text + metadata | otterwiki-semantic-search | `index.py`, `routes.py`, tests | Low — remove truncation, add fields to response |
+ | Configurable dedup | otterwiki-semantic-search | `index.py`, `routes.py`, tests | Low — parameterize existing logic |
+ | Section-level read | otterwiki-mcp | MCP tool definition, markdown parser | Medium — heading tree parser, error handling for ambiguous matches |
+
+ All changes are backward-compatible. Existing consumers see richer results but don't break. The `section` parameter on `read_note` is optional.
+
+ **Reindexing:** Changes 1-3 require a full reindex after deployment. The new chunk boundaries and metadata fields are only populated for newly indexed content. `POST /api/v1/reindex` handles this.
+
+ ## What this design does NOT address
+
+ - **Embedding model upgrade.** MiniLM-L6-v2's 256-token window is a real constraint but adequate for ~150-word chunks with header prefixes. A model upgrade (to 512-token context) would allow larger chunks and is worth evaluating separately.
+ - **Multi-tenant indexing.** Tracked in [[Tasks/Semantic_Search_Architecture]] and [[Tasks/Semantic_Search_Multi_Tenant]]. Orthogonal to this work.
+ - **In-process embedding risks.** The ONNX model in the gunicorn worker and daemon thread shutdown are operational concerns, not search quality concerns. Tracked in [[Tasks/Semantic_Search_Architecture]].
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9