Semantic Search V2

---
category: spec
tags: [design, semantic-search, mcp]
last_updated: 2026-03-16
confidence: high
---

# Semantic Search V2: Section-Aware Chunking and Targeted Reads

This page describes planned improvements to `otterwiki-semantic-search` and `otterwiki-mcp` to make semantic search more useful to agent consumers (Claude.ai, Claude Code). It supersedes the chunking and result format sections of [[Design/Semantic_Search]].

See also: [[Tasks/Semantic_Search_Architecture]] (multi-tenant issues), the empirical findings on the 3GW wiki at `Meta/Page Size And Search Quality`.

## Problem

The current semantic search pipeline has three compounding weaknesses for agent use:

1. **Chunks straddle topic boundaries.** The chunker splits on paragraph breaks (`\n\n`) with no awareness of markdown headings. A 500-word section under `## Russian Substitution Constraints` gets split into 3-4 chunks, and chunks near section boundaries blend content from adjacent, unrelated sections. The resulting embeddings are diluted and match poorly.

2. **Search results are truncated to 150 characters.** Chunks are ~150 words, but the search API truncates the returned snippet to 150 *characters* — discarding ~80% of the retrieved content. The agent can't evaluate relevance from a sentence fragment, so it almost always follows up with `read_note`, which loads the *entire* page.

3. **`read_note` is all-or-nothing.** Once the agent decides it needs more context than the snippet provides, the only option is loading the full page. A 4,000-word page costs ~5,000 tokens of context window to retrieve a 500-word section.

The net effect: semantic search routes to the right page but the agent pays full context cost anyway. The search step adds latency without saving tokens.

## Constraints

**MiniLM-L6-v2 has a 256 wordpiece token context window.** Input beyond 256 tokens is silently truncated — it does not contribute to the embedding. At ~1.3 tokens per word, this means effective content per chunk is capped at ~190 words. The current `TARGET_WORDS = 150` fits within this limit with room for metadata prefixes.

Any chunk size increase requires switching to a model with a longer context window (e.g., E5-small-v2 or BGE-small at 512 tokens). This design keeps MiniLM and works within the 256-token budget.

## Design

### Change 1: Section-aware chunking

**In:** `otterwiki-semantic-search`, `chunking.py`

Replace paragraph-only splitting with heading-aware splitting:

1. Strip YAML frontmatter (unchanged).
2. Parse the markdown into sections by splitting on heading lines (`^#{1,6}\s`). Track a **header stack** — the path of headings from the page title down to the current section (e.g., `["Fertilizer Supply Crisis", "Russian Substitution"]`).
3. Within each section, apply the existing paragraph-accumulation algorithm (target ~150 words, sentence-boundary fallback for oversized paragraphs).
4. **Hard rule:** Never merge content from different sections into the same chunk. A section boundary is always a chunk boundary, even if the preceding chunk is short.
5. **Floor:** If a section is under ~50 words, merge it with the next section at the same or deeper heading level. This prevents stub headings from producing uselessly small chunks.
6. **Overlap:** Continue the 35-word overlap between chunks *within* the same section. Do not carry overlap across section boundaries — the header prefix (below) provides sufficient context bridging.

**Header prefix:** Prepend the header path to each chunk's text before embedding:

```
[Fertilizer Supply Crisis > Russian Substitution] Russia cannot substitute...
```

The bracketed prefix costs ~10-20 wordpiece tokens, reducing effective content to ~130 words per chunk. This is acceptable — a topically coherent 130-word chunk with a descriptive prefix produces a sharper embedding than a topically mixed 150-word chunk without one.

**Chunk metadata** gains a new field:

```json
{
  "page_path": "Trends/Fertilizer Supply Crisis",
  "chunk_index": 2,
  "section": "Russian Substitution",
  "section_path": ["Fertilizer Supply Crisis", "Russian Substitution"],
  "title": "Fertilizer Supply Crisis",
  "category": "trend",
  "tags": "economics, agriculture"
}
```

### Change 2: Return full chunk text in search results

**In:** `otterwiki-semantic-search`, `index.py`

Remove the 150-character snippet truncation. The search API returns the full chunk text (~150 words) as the `snippet` field. This is small enough to be cheap in context and large enough to evaluate relevance without a follow-up read.

**Response format** (additions in bold):

```json
{
  "query": "Russian fertilizer substitution",
  "results": [
    {
      "name": "Trends/Fertilizer Supply Crisis",
      "path": "Trends/Fertilizer Supply Crisis",
      "snippet": "[Fertilizer Supply Crisis > Russian Substitution] Russia cannot substitute...(full ~150 words)...",
      "distance": 0.42,
      "section": "Russian Substitution",
      "section_path": ["Fertilizer Supply Crisis", "Russian Substitution"],
      "chunk_index": 2,
      "total_chunks": 12,
      "page_word_count": 4188
    }
  ],
  "total": 1
}
```

New fields:
- **`section`** / **`section_path`** — where in the page this chunk lives. Gives the agent a handle for targeted reads (Change 4).
- **`chunk_index`** / **`total_chunks`** — positional context.
- **`page_word_count`** — lets the agent estimate the context cost of a full `read_note` and decide whether it's worth it.

### Change 3: Configurable per-page deduplication

**In:** `otterwiki-semantic-search`, `index.py`

Currently the search deduplicates to one chunk per page. This is too aggressive — if three sections of a page are relevant, the agent sees only one.

Add a `max_chunks_per_page` parameter (default 2, max 5) to the search API:

```
GET /api/v1/semantic-search?q=economic+transmission&n=5&max_chunks_per_page=3
```

The deduplication logic changes from "keep best chunk per page" to "keep best N chunks per page." Total results are still capped at `n`.

Default of 2 balances breadth (seeing multiple pages) against depth (seeing multiple sections of the most relevant page).

### Change 4: Section-level read via MCP

**In:** `otterwiki-mcp` (or `otterwiki-api` REST plugin)

Add a `section` parameter to the `read_note` MCP tool:

```
read_note(path="Trends/Fertilizer Supply Crisis", section="Russian Substitution")
```

**Behavior:**

1. Load the full page content.
2. Parse markdown headings into a tree.
3. Find the section matching the `section` parameter. Match against heading text, case-insensitive. If ambiguous (multiple headings with the same text), accept a `/`-delimited path: `"Country Dependencies/Pakistan"`.
4. Return everything from the matched heading to the next heading at the same or higher level.
5. Include the heading itself in the returned content.
6. If no match, return an error listing available sections (so the agent can retry with the correct name).

**Why in the MCP layer, not the REST API:** The REST API serves multiple consumers. Section-level reads are an agent UX optimization — the MCP tool can implement it by fetching the full page from the REST API and slicing locally. This avoids adding complexity to the API surface.

**Alternative considered:** Returning multiple sections in one call (e.g., `sections=["Russian Substitution", "Planting Window"]`). Deferred — the common case is one section per call, and multiple calls are cheap.

## Agent workflow after these changes

1. **`semantic_search("Russian fertilizer constraints")`** — returns 5 results with full chunk text, section paths, and page word counts.
2. Agent reads the snippets. Two chunks from `Fertilizer Supply Crisis` are relevant (sections "Russian Substitution" and "Priority Queue"). One chunk from `P4 Economic Transmission` is relevant.
3. For the 500-word sections where the 150-word snippet isn't enough: **`read_note("Trends/Fertilizer Supply Crisis", section="Russian Substitution")`** — returns ~500 words instead of ~4,200.
4. Agent has the context it needs. Total cost: ~1,500 tokens (5 snippets + 1 section read) vs. ~6,500 tokens today (5 truncated snippets + 1 full page load that's mostly irrelevant).

## Implementation scope

| Change | Repo | Files | Complexity |
|--------|------|-------|------------|
| Section-aware chunking | otterwiki-semantic-search | `chunking.py`, tests | Medium — new heading parser, preserve existing paragraph logic within sections |
| Full chunk text + metadata | otterwiki-semantic-search | `index.py`, `routes.py`, tests | Low — remove truncation, add fields to response |
| Configurable dedup | otterwiki-semantic-search | `index.py`, `routes.py`, tests | Low — parameterize existing logic |
| Section-level read | otterwiki-mcp | MCP tool definition, markdown parser | Medium — heading tree parser, error handling for ambiguous matches |

All changes are backward-compatible. Existing consumers see richer results but don't break. The `section` parameter on `read_note` is optional.

**Reindexing:** Changes 1-3 require a full reindex after deployment. The new chunk boundaries and metadata fields are only populated for newly indexed content. `POST /api/v1/reindex` handles this.

## Deployment notes

**Reindex is mandatory.** Deploy the new `otterwiki-semantic-search` code, then immediately `POST /api/v1/reindex` on each wiki instance. Until reindex completes, old-format chunks (missing `section`, `section_path`, `total_chunks`, `page_word_count`) will return `None` for those fields in search results. The search layer uses `.get()` so it won't crash, but the enriched MCP formatting will degrade silently.

**FAISS sidecar growth.** New metadata fields add ~160 bytes per chunk to `embeddings.json`. For a 10,000-chunk index, the sidecar grows from ~1.4MB to ~2.9MB. The FAISS backend loads the full sidecar into memory on startup and re-serializes on every upsert. This is acceptable for current corpus sizes but worth monitoring for large multi-tenant deployments.

**Heading content in results.** `section`, `section_path`, and the `[prefix]` in `text`/`snippet` contain raw heading text from wiki pages. Consumers rendering these fields as HTML must escape them. The API returns JSON (`Content-Type: application/json`), so the API layer itself is safe.

**Deploy order.** `otterwiki-semantic-search` must deploy and reindex before `otterwiki-mcp` changes are useful. The MCP `section` parameter on `read_note` is independent (parses content client-side), but `format_semantic_results` expects the new result fields which only appear after reindex.

## What this design does NOT address

- **Embedding model upgrade.** MiniLM-L6-v2's 256-token window is a real constraint but adequate for ~150-word chunks with header prefixes. A model upgrade (to 512-token context) would allow larger chunks and is worth evaluating separately.
- **Multi-tenant indexing.** Tracked in [[Tasks/Semantic_Search_Architecture]] and [[Tasks/Semantic_Search_Multi_Tenant]]. Orthogonal to this work.
- **In-process embedding risks.** The ONNX model in the gunicorn worker and daemon thread shutdown are operational concerns, not search quality concerns. Tracked in [[Tasks/Semantic_Search_Architecture]].