Blame

21f5dc Claude (MCP) 2026-03-16 17:36:36
[mcp] [design] Add Semantic Search V2 design: section-aware chunking, full chunk text in results, section-level reads
1
---
2
category: spec
3
tags: [design, semantic-search, mcp]
4
last_updated: 2026-03-16
5
confidence: high
6
---
7
8
# Semantic Search V2: Section-Aware Chunking and Targeted Reads
9
10
This page describes planned improvements to `otterwiki-semantic-search` and `otterwiki-mcp` to make semantic search more useful to agent consumers (Claude.ai, Claude Code). It supersedes the chunking and result format sections of [[Design/Semantic_Search]].
11
12
See also: [[Tasks/Semantic_Search_Architecture]] (multi-tenant issues), the empirical findings on the 3GW wiki at `Meta/Page Size And Search Quality`.
13
14
## Problem
15
16
The current semantic search pipeline has three compounding weaknesses for agent use:
17
18
1. **Chunks straddle topic boundaries.** The chunker splits on paragraph breaks (`\n\n`) with no awareness of markdown headings. A 500-word section under `## Russian Substitution Constraints` gets split into 3-4 chunks, and chunks near section boundaries blend content from adjacent, unrelated sections. The resulting embeddings are diluted and match poorly.
19
20
2. **Search results are truncated to 150 characters.** Chunks are ~150 words, but the search API truncates the returned snippet to 150 *characters* — discarding ~80% of the retrieved content. The agent can't evaluate relevance from a sentence fragment, so it almost always follows up with `read_note`, which loads the *entire* page.
21
22
3. **`read_note` is all-or-nothing.** Once the agent decides it needs more context than the snippet provides, the only option is loading the full page. A 4,000-word page costs ~5,000 tokens of context window to retrieve a 500-word section.
23
24
The net effect: semantic search routes to the right page but the agent pays full context cost anyway. The search step adds latency without saving tokens.
25
26
## Constraints
27
28
**MiniLM-L6-v2 has a 256 wordpiece token context window.** Input beyond 256 tokens is silently truncated — it does not contribute to the embedding. At ~1.3 tokens per word, this means effective content per chunk is capped at ~190 words. The current `TARGET_WORDS = 150` fits within this limit with room for metadata prefixes.
29
30
Any chunk size increase requires switching to a model with a longer context window (e.g., E5-small-v2 or BGE-small at 512 tokens). This design keeps MiniLM and works within the 256-token budget.
31
32
## Design
33
34
### Change 1: Section-aware chunking
35
36
**In:** `otterwiki-semantic-search`, `chunking.py`
37
38
Replace paragraph-only splitting with heading-aware splitting:
39
40
1. Strip YAML frontmatter (unchanged).
41
2. Parse the markdown into sections by splitting on heading lines (`^#{1,6}\s`). Track a **header stack** — the path of headings from the page title down to the current section (e.g., `["Fertilizer Supply Crisis", "Russian Substitution"]`).
42
3. Within each section, apply the existing paragraph-accumulation algorithm (target ~150 words, sentence-boundary fallback for oversized paragraphs).
43
4. **Hard rule:** Never merge content from different sections into the same chunk. A section boundary is always a chunk boundary, even if the preceding chunk is short.
44
5. **Floor:** If a section is under ~50 words, merge it with the next section at the same or deeper heading level. This prevents stub headings from producing uselessly small chunks.
45
6. **Overlap:** Continue the 35-word overlap between chunks *within* the same section. Do not carry overlap across section boundaries — the header prefix (below) provides sufficient context bridging.
46
47
**Header prefix:** Prepend the header path to each chunk's text before embedding:
48
49
```
50
[Fertilizer Supply Crisis > Russian Substitution] Russia cannot substitute...
51
```
52
53
The bracketed prefix costs ~10-20 wordpiece tokens, reducing effective content to ~130 words per chunk. This is acceptable — a topically coherent 130-word chunk with a descriptive prefix produces a sharper embedding than a topically mixed 150-word chunk without one.
54
55
**Chunk metadata** gains a new field:
56
57
```json
58
{
59
"page_path": "Trends/Fertilizer Supply Crisis",
60
"chunk_index": 2,
61
"section": "Russian Substitution",
62
"section_path": ["Fertilizer Supply Crisis", "Russian Substitution"],
63
"title": "Fertilizer Supply Crisis",
64
"category": "trend",
65
"tags": "economics, agriculture"
66
}
67
```
68
69
### Change 2: Return full chunk text in search results
70
71
**In:** `otterwiki-semantic-search`, `index.py`
72
73
Remove the 150-character snippet truncation. The search API returns the full chunk text (~150 words) as the `snippet` field. This is small enough to be cheap in context and large enough to evaluate relevance without a follow-up read.
74
75
**Response format** (additions in bold):
76
77
```json
78
{
79
"query": "Russian fertilizer substitution",
80
"results": [
81
{
82
"name": "Trends/Fertilizer Supply Crisis",
83
"path": "Trends/Fertilizer Supply Crisis",
84
"snippet": "[Fertilizer Supply Crisis > Russian Substitution] Russia cannot substitute...(full ~150 words)...",
85
"distance": 0.42,
86
"section": "Russian Substitution",
87
"section_path": ["Fertilizer Supply Crisis", "Russian Substitution"],
88
"chunk_index": 2,
89
"total_chunks": 12,
90
"page_word_count": 4188
91
}
92
],
93
"total": 1
94
}
95
```
96
97
New fields:
98
- **`section`** / **`section_path`** — where in the page this chunk lives. Gives the agent a handle for targeted reads (Change 4).
99
- **`chunk_index`** / **`total_chunks`** — positional context.
100
- **`page_word_count`** — lets the agent estimate the context cost of a full `read_note` and decide whether it's worth it.
101
102
### Change 3: Configurable per-page deduplication
103
104
**In:** `otterwiki-semantic-search`, `index.py`
105
106
Currently the search deduplicates to one chunk per page. This is too aggressive — if three sections of a page are relevant, the agent sees only one.
107
108
Add a `max_chunks_per_page` parameter (default 2, max 5) to the search API:
109
110
```
111
GET /api/v1/semantic-search?q=economic+transmission&n=5&max_chunks_per_page=3
112
```
113
114
The deduplication logic changes from "keep best chunk per page" to "keep best N chunks per page." Total results are still capped at `n`.
115
116
Default of 2 balances breadth (seeing multiple pages) against depth (seeing multiple sections of the most relevant page).
117
118
### Change 4: Section-level read via MCP
119
120
**In:** `otterwiki-mcp` (or `otterwiki-api` REST plugin)
121
122
Add a `section` parameter to the `read_note` MCP tool:
123
124
```
125
read_note(path="Trends/Fertilizer Supply Crisis", section="Russian Substitution")
126
```
127
128
**Behavior:**
129
130
1. Load the full page content.
131
2. Parse markdown headings into a tree.
132
3. Find the section matching the `section` parameter. Match against heading text, case-insensitive. If ambiguous (multiple headings with the same text), accept a `/`-delimited path: `"Country Dependencies/Pakistan"`.
133
4. Return everything from the matched heading to the next heading at the same or higher level.
134
5. Include the heading itself in the returned content.
135
6. If no match, return an error listing available sections (so the agent can retry with the correct name).
136
137
**Why in the MCP layer, not the REST API:** The REST API serves multiple consumers. Section-level reads are an agent UX optimization — the MCP tool can implement it by fetching the full page from the REST API and slicing locally. This avoids adding complexity to the API surface.
138
139
**Alternative considered:** Returning multiple sections in one call (e.g., `sections=["Russian Substitution", "Planting Window"]`). Deferred — the common case is one section per call, and multiple calls are cheap.
140
141
## Agent workflow after these changes
142
143
1. **`semantic_search("Russian fertilizer constraints")`** — returns 5 results with full chunk text, section paths, and page word counts.
144
2. Agent reads the snippets. Two chunks from `Fertilizer Supply Crisis` are relevant (sections "Russian Substitution" and "Priority Queue"). One chunk from `P4 Economic Transmission` is relevant.
145
3. For the 500-word sections where the 150-word snippet isn't enough: **`read_note("Trends/Fertilizer Supply Crisis", section="Russian Substitution")`** — returns ~500 words instead of ~4,200.
146
4. Agent has the context it needs. Total cost: ~1,500 tokens (5 snippets + 1 section read) vs. ~6,500 tokens today (5 truncated snippets + 1 full page load that's mostly irrelevant).
147
148
## Implementation scope
149
150
| Change | Repo | Files | Complexity |
151
|--------|------|-------|------------|
152
| Section-aware chunking | otterwiki-semantic-search | `chunking.py`, tests | Medium — new heading parser, preserve existing paragraph logic within sections |
153
| Full chunk text + metadata | otterwiki-semantic-search | `index.py`, `routes.py`, tests | Low — remove truncation, add fields to response |
154
| Configurable dedup | otterwiki-semantic-search | `index.py`, `routes.py`, tests | Low — parameterize existing logic |
155
| Section-level read | otterwiki-mcp | MCP tool definition, markdown parser | Medium — heading tree parser, error handling for ambiguous matches |
156
157
All changes are backward-compatible. Existing consumers see richer results but don't break. The `section` parameter on `read_note` is optional.
158
159
**Reindexing:** Changes 1-3 require a full reindex after deployment. The new chunk boundaries and metadata fields are only populated for newly indexed content. `POST /api/v1/reindex` handles this.
160
ccd74b Claude (MCP) 2026-03-16 18:06:44
[mcp] [design] Add deployment notes from review findings: reindex, sidecar growth, escaping, deploy order
161
## Deployment notes
162
163
**Reindex is mandatory.** Deploy the new `otterwiki-semantic-search` code, then immediately `POST /api/v1/reindex` on each wiki instance. Until reindex completes, old-format chunks (missing `section`, `section_path`, `total_chunks`, `page_word_count`) will return `None` for those fields in search results. The search layer uses `.get()` so it won't crash, but the enriched MCP formatting will degrade silently.
164
165
**FAISS sidecar growth.** New metadata fields add ~160 bytes per chunk to `embeddings.json`. For a 10,000-chunk index, the sidecar grows from ~1.4MB to ~2.9MB. The FAISS backend loads the full sidecar into memory on startup and re-serializes on every upsert. This is acceptable for current corpus sizes but worth monitoring for large multi-tenant deployments.
166
167
**Heading content in results.** `section`, `section_path`, and the `[prefix]` in `text`/`snippet` contain raw heading text from wiki pages. Consumers rendering these fields as HTML must escape them. The API returns JSON (`Content-Type: application/json`), so the API layer itself is safe.
168
169
**Deploy order.** `otterwiki-semantic-search` must deploy and reindex before `otterwiki-mcp` changes are useful. The MCP `section` parameter on `read_note` is independent (parses content client-side), but `format_semantic_results` expects the new result fields which only appear after reindex.
170
21f5dc Claude (MCP) 2026-03-16 17:36:36
[mcp] [design] Add Semantic Search V2 design: section-aware chunking, full chunk text in results, section-level reads
171
## What this design does NOT address
172
173
- **Embedding model upgrade.** MiniLM-L6-v2's 256-token window is a real constraint but adequate for ~150-word chunks with header prefixes. A model upgrade (to 512-token context) would allow larger chunks and is worth evaluating separately.
174
- **Multi-tenant indexing.** Tracked in [[Tasks/Semantic_Search_Architecture]] and [[Tasks/Semantic_Search_Multi_Tenant]]. Orthogonal to this work.
175
- **In-process embedding risks.** The ONNX model in the gunicorn worker and daemon thread shutdown are operational concerns, not search quality concerns. Tracked in [[Tasks/Semantic_Search_Architecture]].