Properties

category: spec
tags: [design, semantic-search, mcp]
last_updated: 2026-03-16
confidence: high

Semantic Search V2: Section-Aware Chunking and Targeted Reads

This page describes planned improvements to otterwiki-semantic-search and otterwiki-mcp to make semantic search more useful to agent consumers (Claude.ai, Claude Code). It supersedes the chunking and result format sections of Design/Semantic_Search.

See also: Tasks/Semantic_Search_Architecture (multi-tenant issues), the empirical findings on the 3GW wiki at Meta/Page Size And Search Quality.

Problem

The current semantic search pipeline has three compounding weaknesses for agent use:

Chunks straddle topic boundaries. The chunker splits on paragraph breaks (\n\n) with no awareness of markdown headings. A 500-word section under ## Russian Substitution Constraints gets split into 3-4 chunks, and chunks near section boundaries blend content from adjacent, unrelated sections. The resulting embeddings are diluted and match poorly.
Search results are truncated to 150 characters. Chunks are ~150 words, but the search API truncates the returned snippet to 150 characters — discarding ~80% of the retrieved content. The agent can't evaluate relevance from a sentence fragment, so it almost always follows up with read_note, which loads the entire page.
read_note is all-or-nothing. Once the agent decides it needs more context than the snippet provides, the only option is loading the full page. A 4,000-word page costs ~5,000 tokens of context window to retrieve a 500-word section.

The net effect: semantic search routes to the right page but the agent pays full context cost anyway. The search step adds latency without saving tokens.

Constraints

MiniLM-L6-v2 has a 256 wordpiece token context window. Input beyond 256 tokens is silently truncated — it does not contribute to the embedding. At ~1.3 tokens per word, this means effective content per chunk is capped at ~190 words. The current TARGET_WORDS = 150 fits within this limit with room for metadata prefixes.

Any chunk size increase requires switching to a model with a longer context window (e.g., E5-small-v2 or BGE-small at 512 tokens). This design keeps MiniLM and works within the 256-token budget.

Design

Change 1: Section-aware chunking

In: otterwiki-semantic-search, chunking.py

Replace paragraph-only splitting with heading-aware splitting:

Strip YAML frontmatter (unchanged).
Parse the markdown into sections by splitting on heading lines (^#{1,6}\s). Track a header stack — the path of headings from the page title down to the current section (e.g., ["Fertilizer Supply Crisis", "Russian Substitution"]).
Within each section, apply the existing paragraph-accumulation algorithm (target ~150 words, sentence-boundary fallback for oversized paragraphs).
Hard rule: Never merge content from different sections into the same chunk. A section boundary is always a chunk boundary, even if the preceding chunk is short.
Floor: If a section is under ~50 words, merge it with the next section at the same or deeper heading level. This prevents stub headings from producing uselessly small chunks.
Overlap: Continue the 35-word overlap between chunks within the same section. Do not carry overlap across section boundaries — the header prefix (below) provides sufficient context bridging.

Header prefix: Prepend the header path to each chunk's text before embedding:

[Fertilizer Supply Crisis > Russian Substitution] Russia cannot substitute...

The bracketed prefix costs ~10-20 wordpiece tokens, reducing effective content to ~130 words per chunk. This is acceptable — a topically coherent 130-word chunk with a descriptive prefix produces a sharper embedding than a topically mixed 150-word chunk without one.

Chunk metadata gains a new field:

{
  "page_path": "Trends/Fertilizer Supply Crisis",
  "chunk_index": 2,
  "section": "Russian Substitution",
  "section_path": ["Fertilizer Supply Crisis", "Russian Substitution"],
  "title": "Fertilizer Supply Crisis",
  "category": "trend",
  "tags": "economics, agriculture"
}

Change 2: Return full chunk text in search results

In: otterwiki-semantic-search, index.py

Remove the 150-character snippet truncation. The search API returns the full chunk text (~150 words) as the snippet field. This is small enough to be cheap in context and large enough to evaluate relevance without a follow-up read.

Response format (additions in bold):

{
  "query": "Russian fertilizer substitution",
  "results": [
    {
      "name": "Trends/Fertilizer Supply Crisis",
      "path": "Trends/Fertilizer Supply Crisis",
      "snippet": "[Fertilizer Supply Crisis > Russian Substitution] Russia cannot substitute...(full ~150 words)...",
      "distance": 0.42,
      "section": "Russian Substitution",
      "section_path": ["Fertilizer Supply Crisis", "Russian Substitution"],
      "chunk_index": 2,
      "total_chunks": 12,
      "page_word_count": 4188
    }
  ],
  "total": 1
}

New fields:

section / section_path — where in the page this chunk lives. Gives the agent a handle for targeted reads (Change 4).
chunk_index / total_chunks — positional context.
page_word_count — lets the agent estimate the context cost of a full read_note and decide whether it's worth it.

Change 3: Configurable per-page deduplication

In: otterwiki-semantic-search, index.py

Currently the search deduplicates to one chunk per page. This is too aggressive — if three sections of a page are relevant, the agent sees only one.

Add a max_chunks_per_page parameter (default 2, max 5) to the search API:

GET /api/v1/semantic-search?q=economic+transmission&n=5&max_chunks_per_page=3

The deduplication logic changes from "keep best chunk per page" to "keep best N chunks per page." Total results are still capped at n.

Default of 2 balances breadth (seeing multiple pages) against depth (seeing multiple sections of the most relevant page).

Change 4: Section-level read via MCP

In: otterwiki-mcp (or otterwiki-api REST plugin)

Add a section parameter to the read_note MCP tool:

read_note(path="Trends/Fertilizer Supply Crisis", section="Russian Substitution")

Behavior:

Load the full page content.
Parse markdown headings into a tree.
Find the section matching the section parameter. Match against heading text, case-insensitive. If ambiguous (multiple headings with the same text), accept a /-delimited path: "Country Dependencies/Pakistan".
Return everything from the matched heading to the next heading at the same or higher level.
Include the heading itself in the returned content.
If no match, return an error listing available sections (so the agent can retry with the correct name).

Why in the MCP layer, not the REST API: The REST API serves multiple consumers. Section-level reads are an agent UX optimization — the MCP tool can implement it by fetching the full page from the REST API and slicing locally. This avoids adding complexity to the API surface.

Alternative considered: Returning multiple sections in one call (e.g., sections=["Russian Substitution", "Planting Window"]). Deferred — the common case is one section per call, and multiple calls are cheap.

Agent workflow after these changes

semantic_search("Russian fertilizer constraints") — returns 5 results with full chunk text, section paths, and page word counts.
Agent reads the snippets. Two chunks from Fertilizer Supply Crisis are relevant (sections "Russian Substitution" and "Priority Queue"). One chunk from P4 Economic Transmission is relevant.
For the 500-word sections where the 150-word snippet isn't enough: read_note("Trends/Fertilizer Supply Crisis", section="Russian Substitution") — returns ~500 words instead of ~4,200.
Agent has the context it needs. Total cost: ~1,500 tokens (5 snippets + 1 section read) vs. ~6,500 tokens today (5 truncated snippets + 1 full page load that's mostly irrelevant).

Implementation scope

Change	Repo	Files	Complexity
Section-aware chunking	otterwiki-semantic-search	`chunking.py`, tests	Medium — new heading parser, preserve existing paragraph logic within sections
Full chunk text + metadata	otterwiki-semantic-search	`index.py`, `routes.py`, tests	Low — remove truncation, add fields to response
Configurable dedup	otterwiki-semantic-search	`index.py`, `routes.py`, tests	Low — parameterize existing logic
Section-level read	otterwiki-mcp	MCP tool definition, markdown parser	Medium — heading tree parser, error handling for ambiguous matches

All changes are backward-compatible. Existing consumers see richer results but don't break. The section parameter on read_note is optional.

Reindexing: Changes 1-3 require a full reindex after deployment. The new chunk boundaries and metadata fields are only populated for newly indexed content. POST /api/v1/reindex handles this.

Deployment notes

Reindex is mandatory. Deploy the new otterwiki-semantic-search code, then immediately POST /api/v1/reindex on each wiki instance. Until reindex completes, old-format chunks (missing section, section_path, total_chunks, page_word_count) will return None for those fields in search results. The search layer uses .get() so it won't crash, but the enriched MCP formatting will degrade silently.

FAISS sidecar growth. New metadata fields add ~160 bytes per chunk to embeddings.json. For a 10,000-chunk index, the sidecar grows from ~1.4MB to ~2.9MB. The FAISS backend loads the full sidecar into memory on startup and re-serializes on every upsert. This is acceptable for current corpus sizes but worth monitoring for large multi-tenant deployments.

Heading content in results. section, section_path, and the [prefix] in text/snippet contain raw heading text from wiki pages. Consumers rendering these fields as HTML must escape them. The API returns JSON (Content-Type: application/json), so the API layer itself is safe.

Deploy order. otterwiki-semantic-search must deploy and reindex before otterwiki-mcp changes are useful. The MCP section parameter on read_note is independent (parses content client-side), but format_semantic_results expects the new result fields which only appear after reindex.

What this design does NOT address

Embedding model upgrade. MiniLM-L6-v2's 256-token window is a real constraint but adequate for ~150-word chunks with header prefixes. A model upgrade (to 512-token context) would allow larger chunks and is worth evaluating separately.
Multi-tenant indexing. Tracked in Tasks/Semantic_Search_Architecture and Tasks/Semantic_Search_Multi_Tenant. Orthogonal to this work.
In-process embedding risks. The ONNX model in the gunicorn worker and daemon thread shutdown are operational concerns, not search quality concerns. Tracked in Tasks/Semantic_Search_Architecture.