commit 21f5dc

Commit `21f5dc`

2026-03-16 17:36:36 Claude (MCP): [mcp] [design] Add Semantic Search V2 design: section-aware chunking, full chunk text in results, section-level reads

`/dev/null` .. `Design/Semantic_Search_V2.md`
@@ 0,0 1,165 @@
+	---
+	category: spec
+	tags: [design, semantic-search, mcp]
+	last_updated: 2026-03-16
+	confidence: high
+	---
+
+	# Semantic Search V2: Section-Aware Chunking and Targeted Reads
+
+	This page describes planned improvements to `otterwiki-semantic-search` and `otterwiki-mcp` to make semantic search more useful to agent consumers (Claude.ai, Claude Code). It supersedes the chunking and result format sections of [[Design/Semantic_Search]].
+
+	See also: [[Tasks/Semantic_Search_Architecture]] (multi-tenant issues), the empirical findings on the 3GW wiki at `Meta/Page Size And Search Quality`.
+
+	## Problem
+
+	The current semantic search pipeline has three compounding weaknesses for agent use:
+
+	1. Chunks straddle topic boundaries. The chunker splits on paragraph breaks (`\n\n`) with no awareness of markdown headings. A 500-word section under `## Russian Substitution Constraints` gets split into 3-4 chunks, and chunks near section boundaries blend content from adjacent, unrelated sections. The resulting embeddings are diluted and match poorly.
+
+	2. Search results are truncated to 150 characters. Chunks are ~150 words, but the search API truncates the returned snippet to 150 characters — discarding ~80% of the retrieved content. The agent can't evaluate relevance from a sentence fragment, so it almost always follows up with `read_note`, which loads the entire page.
+
+	3. `read_note` is all-or-nothing. Once the agent decides it needs more context than the snippet provides, the only option is loading the full page. A 4,000-word page costs ~5,000 tokens of context window to retrieve a 500-word section.
+
+	The net effect: semantic search routes to the right page but the agent pays full context cost anyway. The search step adds latency without saving tokens.
+
+	## Constraints
+
+	MiniLM-L6-v2 has a 256 wordpiece token context window. Input beyond 256 tokens is silently truncated — it does not contribute to the embedding. At ~1.3 tokens per word, this means effective content per chunk is capped at ~190 words. The current `TARGET_WORDS = 150` fits within this limit with room for metadata prefixes.
+
+	Any chunk size increase requires switching to a model with a longer context window (e.g., E5-small-v2 or BGE-small at 512 tokens). This design keeps MiniLM and works within the 256-token budget.
+
+	## Design
+
+	### Change 1: Section-aware chunking
+
+	In: `otterwiki-semantic-search`, `chunking.py`
+
+	Replace paragraph-only splitting with heading-aware splitting:
+
+	1. Strip YAML frontmatter (unchanged).
+	2. Parse the markdown into sections by splitting on heading lines (`^#{1,6}\s`). Track a header stack — the path of headings from the page title down to the current section (e.g., `["Fertilizer Supply Crisis", "Russian Substitution"]`).
+	3. Within each section, apply the existing paragraph-accumulation algorithm (target ~150 words, sentence-boundary fallback for oversized paragraphs).
+	4. Hard rule: Never merge content from different sections into the same chunk. A section boundary is always a chunk boundary, even if the preceding chunk is short.
+	5. Floor: If a section is under ~50 words, merge it with the next section at the same or deeper heading level. This prevents stub headings from producing uselessly small chunks.
+	6. Overlap: Continue the 35-word overlap between chunks within the same section. Do not carry overlap across section boundaries — the header prefix (below) provides sufficient context bridging.
+
+	Header prefix: Prepend the header path to each chunk's text before embedding:
+
+	```
+	[Fertilizer Supply Crisis > Russian Substitution] Russia cannot substitute...
+	```
+
+	The bracketed prefix costs ~10-20 wordpiece tokens, reducing effective content to ~130 words per chunk. This is acceptable — a topically coherent 130-word chunk with a descriptive prefix produces a sharper embedding than a topically mixed 150-word chunk without one.
+
+	Chunk metadata gains a new field:
+
+	```json
+	{
+	"page_path": "Trends/Fertilizer Supply Crisis",
+	"chunk_index": 2,
+	"section": "Russian Substitution",
+	"section_path": ["Fertilizer Supply Crisis", "Russian Substitution"],
+	"title": "Fertilizer Supply Crisis",
+	"category": "trend",
+	"tags": "economics, agriculture"
+	}
+	```
+
+	### Change 2: Return full chunk text in search results
+
+	In: `otterwiki-semantic-search`, `index.py`
+
+	Remove the 150-character snippet truncation. The search API returns the full chunk text (~150 words) as the `snippet` field. This is small enough to be cheap in context and large enough to evaluate relevance without a follow-up read.
+
+	Response format (additions in bold):
+
+	```json
+	{
+	"query": "Russian fertilizer substitution",
+	"results": [
+	{
+	"name": "Trends/Fertilizer Supply Crisis",
+	"path": "Trends/Fertilizer Supply Crisis",
+	"snippet": "[Fertilizer Supply Crisis > Russian Substitution] Russia cannot substitute...(full ~150 words)...",
+	"distance": 0.42,
+	"section": "Russian Substitution",
+	"section_path": ["Fertilizer Supply Crisis", "Russian Substitution"],
+	"chunk_index": 2,
+	"total_chunks": 12,
+	"page_word_count": 4188
+	}
+	],
+	"total": 1
+	}
+	```
+
+	New fields:
+	- `section` / `section_path` — where in the page this chunk lives. Gives the agent a handle for targeted reads (Change 4).
+	- `chunk_index` / `total_chunks` — positional context.
+	- `page_word_count` — lets the agent estimate the context cost of a full `read_note` and decide whether it's worth it.
+
+	### Change 3: Configurable per-page deduplication
+
+	In: `otterwiki-semantic-search`, `index.py`
+
+	Currently the search deduplicates to one chunk per page. This is too aggressive — if three sections of a page are relevant, the agent sees only one.
+
+	Add a `max_chunks_per_page` parameter (default 2, max 5) to the search API:
+
+	```
+	GET /api/v1/semantic-search?q=economic+transmission&n=5&max_chunks_per_page=3
+	```
+
+	The deduplication logic changes from "keep best chunk per page" to "keep best N chunks per page." Total results are still capped at `n`.
+
+	Default of 2 balances breadth (seeing multiple pages) against depth (seeing multiple sections of the most relevant page).
+
+	### Change 4: Section-level read via MCP
+
+	In: `otterwiki-mcp` (or `otterwiki-api` REST plugin)
+
+	Add a `section` parameter to the `read_note` MCP tool:
+
+	```
+	read_note(path="Trends/Fertilizer Supply Crisis", section="Russian Substitution")
+	```
+
+	Behavior:
+
+	1. Load the full page content.
+	2. Parse markdown headings into a tree.
+	3. Find the section matching the `section` parameter. Match against heading text, case-insensitive. If ambiguous (multiple headings with the same text), accept a `/`-delimited path: `"Country Dependencies/Pakistan"`.
+	4. Return everything from the matched heading to the next heading at the same or higher level.
+	5. Include the heading itself in the returned content.
+	6. If no match, return an error listing available sections (so the agent can retry with the correct name).
+
+	Why in the MCP layer, not the REST API: The REST API serves multiple consumers. Section-level reads are an agent UX optimization — the MCP tool can implement it by fetching the full page from the REST API and slicing locally. This avoids adding complexity to the API surface.
+
+	Alternative considered: Returning multiple sections in one call (e.g., `sections=["Russian Substitution", "Planting Window"]`). Deferred — the common case is one section per call, and multiple calls are cheap.
+
+	## Agent workflow after these changes
+
+	1. `semantic_search("Russian fertilizer constraints")` — returns 5 results with full chunk text, section paths, and page word counts.
+	2. Agent reads the snippets. Two chunks from `Fertilizer Supply Crisis` are relevant (sections "Russian Substitution" and "Priority Queue"). One chunk from `P4 Economic Transmission` is relevant.
+	3. For the 500-word sections where the 150-word snippet isn't enough: `read_note("Trends/Fertilizer Supply Crisis", section="Russian Substitution")` — returns ~500 words instead of ~4,200.
+	4. Agent has the context it needs. Total cost: ~1,500 tokens (5 snippets + 1 section read) vs. ~6,500 tokens today (5 truncated snippets + 1 full page load that's mostly irrelevant).
+
+	## Implementation scope
+
+	\| Change \| Repo \| Files \| Complexity \|
+	\|--------\|------\|-------\|------------\|
+	\| Section-aware chunking \| otterwiki-semantic-search \| `chunking.py`, tests \| Medium — new heading parser, preserve existing paragraph logic within sections \|
+	\| Full chunk text + metadata \| otterwiki-semantic-search \| `index.py`, `routes.py`, tests \| Low — remove truncation, add fields to response \|
+	\| Configurable dedup \| otterwiki-semantic-search \| `index.py`, `routes.py`, tests \| Low — parameterize existing logic \|
+	\| Section-level read \| otterwiki-mcp \| MCP tool definition, markdown parser \| Medium — heading tree parser, error handling for ambiguous matches \|
+
+	All changes are backward-compatible. Existing consumers see richer results but don't break. The `section` parameter on `read_note` is optional.
+
+	Reindexing: Changes 1-3 require a full reindex after deployment. The new chunk boundaries and metadata fields are only populated for newly indexed content. `POST /api/v1/reindex` handles this.
+
+	## What this design does NOT address
+
+	- Embedding model upgrade. MiniLM-L6-v2's 256-token window is a real constraint but adequate for ~150-word chunks with header prefixes. A model upgrade (to 512-token context) would allow larger chunks and is worth evaluating separately.
+	- Multi-tenant indexing. Tracked in [[Tasks/Semantic_Search_Architecture]] and [[Tasks/Semantic_Search_Multi_Tenant]]. Orthogonal to this work.
+	- In-process embedding risks. The ONNX model in the gunicorn worker and daemon thread shutdown are operational concerns, not search quality concerns. Tracked in [[Tasks/Semantic_Search_Architecture]].

Commit 21f5dc

Commit `21f5dc`