Properties

category: reference
tags: [meta, design, prd, semantic-search]
last_updated: 2026-03-12
confidence: high

Original PRD Semantic Search

This page is part of the original single-tenant PRD, split across five wiki pages: Design/Original PRD Overview | Design/Original PRD API | Design/Original PRD Semantic Search | Design/Original PRD MCP | Design/Original PRD Note Schema

Component 2: Chroma Semantic Search Plugin

Goal

Maintain a vector index of all wiki pages in ChromaDB, enabling semantic/similarity search. When a page is created, updated, or deleted, the index is updated automatically.

Implementation approach

This should hook into Otterwiki's page save/delete lifecycle. Again, investigate whether the plugin hook system supports after_save / after_delete style hooks.

If hooks exist for page lifecycle events: Build as a plugin.
If not: Add hook calls into the Page.save() and Page.delete() methods in wiki.py, and build the Chroma logic as a plugin that registers for those hooks. Or, alternatively, have the API plugin handle indexing as a side effect of PUT/DELETE operations, and add a /api/v1/reindex endpoint for bulk rebuild.

ChromaDB configuration

Collection name: otterwiki_pages
Embedding: Use Chroma's default all-MiniLM-L6-v2 sentence-transformer (runs locally, no API key needed, small footprint). Note: verify exact max sequence length at implementation time — if it's 256 tokens, the chunking approach below handles it correctly regardless.
Metadata stored per chunk: {page_path, page_name, category, tags, last_updated, chunk_index} — extracted from YAML frontmatter. The page_path and chunk_index fields are used for deduplication and reassembly.

Chunking strategy

Pages are split into overlapping chunks for embedding. Each chunk is stored as a separate Chroma document. This ensures semantic search quality is independent of page length — a 300-word note and a 1500-word note are both fully indexed.

Chunking algorithm:

Strip YAML frontmatter from content (metadata is stored separately, not embedded).
Split on paragraph boundaries (double newline \n\n).
Accumulate paragraphs into chunks of ~200 tokens (~150 words). If a single paragraph exceeds 200 tokens, split it at sentence boundaries (. followed by a capital letter or newline).
Add ~50 tokens of overlap between adjacent chunks — repeat the last 1–2 sentences of the previous chunk at the start of the next. This prevents concepts spanning a boundary from being lost.
Assign each chunk an ID: {page_path}::chunk_{index} (e.g., Trends/Iran Attrition Strategy::chunk_0).

Short pages: If the entire page body (after frontmatter) is under 200 tokens, store it as a single chunk. No need to split.

Example: A 600-word page might produce 4 chunks of ~150 words each, with ~35 words of overlap between adjacent chunks.

def chunk_page(content: str, target_tokens: int = 200, overlap_tokens: int = 50) -> list[str]:
    """Split page content into overlapping chunks for embedding.

    Args:
        content: Page body text (frontmatter already stripped)
        target_tokens: Approximate tokens per chunk (~0.75 words per token)
        overlap_tokens: Approximate overlap between adjacent chunks

    Returns:
        List of chunk strings
    """
    # Implementation: split on paragraphs, accumulate to target size,
    # carry overlap from previous chunk. Fall back to sentence splitting
    # for oversized paragraphs.

Search result deduplication

Semantic search queries Chroma for the top n * 3 chunks (to account for multiple chunks from the same page), then deduplicates by page_path, keeping the best-matching (lowest distance) chunk per page, and returns the top n unique pages.

The snippet in the search response is the text of the best-matching chunk for that page, truncated to ~150 characters. This means the snippet is contextually relevant to the query, not just the page's opening paragraph.

API endpoints (added to the REST API)

Method	Endpoint	Description
`GET`	`/api/v1/semantic-search?q=<query>&n=5`	Semantic similarity search. Returns top N results as `{name, path, snippet, distance}`. Results are deduplicated by page.
`POST`	`/api/v1/reindex`	Rebuild the entire Chroma index from the Git repo. Deletes all existing chunks and re-indexes all pages. For initial population and recovery.

Index maintenance

On PUT /api/v1/pages/<path> (create/update): delete all existing chunks for that page path, then re-chunk and insert. This is simpler and safer than trying to diff chunks.
On DELETE /api/v1/pages/<path>: delete all chunks for that page path.
On page save via Otterwiki web UI: if hooks are available, also update Chroma. If not, run a periodic sync (see below).

Fallback: periodic sync

If lifecycle hooks are unavailable or unreliable, implement a background sync that runs every 60 seconds:

git log --since=<last_sync_time> --name-only to find changed files
Re-index only those files in Chroma
Update last_sync_time

State persistence: last_sync_time is stored in a small file at /app-data/chroma_sync_state.json containing {"last_sync": "2026-03-09T14:22:00Z"}. This persists across container restarts.

First boot / missing state: If the state file doesn't exist, or if the Chroma collection is empty, perform a full reindex of all pages. This is the same operation as POST /api/v1/reindex.

Race condition mitigation: If a page is saved via the web UI and queried via semantic search within the sync window (up to 60 seconds), the search may return stale results. This is acceptable — the full-text search endpoint (/api/v1/search) reads directly from Git and is always current. The MCP server can fall back to full-text search when recency matters.

Implementation: Use a background thread started on Flask app initialization (e.g., threading.Timer with a recurring callback), NOT a cron job. This keeps everything in one process and avoids external dependencies.

This ensures edits made via the web UI are reflected in semantic search even without hooks.