Properties
category: reference
tags: [meta, design, prd, semantic-search, embedding]
last_updated: 2026-03-14
confidence: high

Async Embedding Pipeline

Superseded. This page describes the DynamoDB Streams + Lambda embedding pipeline for AWS. See Design/VPS_Architecture for the current plan (SQLite queue + in-process background worker). The chunking algorithm, MiniLM model choice, and FAISS indexing logic carry forward; the DynamoDB Streams trigger mechanism does not.

This page describes the revised architecture for semantic search indexing. It replaces the Bedrock-based embedding pipeline described in the original PRD with a zero-fixed-cost alternative.

See also: Design/Semantic_Search (original single-tenant PRD), Design/Data_Model, Design/Operations


Design goals

  1. No fixed monthly costs for semantic search infrastructure. Every cost should be per-use.
  2. No dependency on Bedrock or any paid embedding API.
  3. Semantic search available to all users, not gated behind premium tier.
  4. Acceptable latency for the use case: research wikis are mostly written by AI agents and searched minutes-to-hours later. Near-real-time indexing is not required.

Why not Bedrock

The per-token cost of Bedrock Titan Text Embeddings V2 is negligible (~$0.02 per million input tokens). The real cost is the VPC interface endpoint required for Lambda to reach Bedrock from inside the VPC. At \(0.01/hr per AZ, that's ~\)7.20/month in a single-AZ dev setup and ~$14.40/month in a 2-AZ production deployment. This is a fixed cost that exists whether anyone uses semantic search or not.

The same logic applies to SQS: it's also an interface endpoint (~$7.20/month/AZ) if the wiki Lambda needs to post messages from inside the VPC.

Both costs are avoidable.


Architecture

Page write (wiki Lambda, VPC)
  → DynamoDB write: {wiki_id, page_path, action, timestamp} to reindex_queue table
  → DynamoDB Streams captures the change
  → Lambda service polls the stream (outside the function's VPC context)
  → Embedding Lambda (VPC, EFS mount) is invoked with the stream record:
      1. Read page content from EFS repo
      2. Chunk page (same algorithm as otterwiki-semantic-search)
      3. Embed chunks using all-MiniLM-L6-v2 (runs locally in the Lambda)
      4. Load current FAISS index from EFS
      5. Update index (remove old vectors for page, add new ones)
      6. Write updated index + sidecar metadata to EFS

Key properties

No new VPC endpoints required. The wiki Lambda already has a free gateway endpoint for DynamoDB (deployed in Phase 0). The embedding Lambda needs VPC access for EFS (free — Lambda VPC attachment has no per-endpoint cost). The DynamoDB Streams polling is done by the Lambda service, not by the function itself, so no endpoint is needed for that either.

DynamoDB Streams → Lambda is free. GetRecords API calls invoked through Lambda triggers are not charged. You pay only for the Lambda invocation time and the DynamoDB write (~$0.00000125 per item).

MiniLM runs locally. The all-MiniLM-L6-v2 sentence-transformer is a ~80MB model that produces 384-dimensional vectors. It runs on CPU and is fast enough for batch embedding of wiki pages. The embedding Lambda should be deployed as a container image to accommodate the model size (Lambda zip limit is 250MB; the model + dependencies would be tight). Alternatively, the model can be stored on EFS and loaded at init time.

FAISS index dimensions change. Bedrock Titan produces 1024-dimensional vectors. MiniLM produces 384-dimensional vectors. This means smaller indexes (~1.5KB per vector instead of ~4KB) and slightly lower semantic quality on multilingual content. For English-language research wikis, MiniLM is more than adequate.

Search path

The search path is synchronous and handled by the wiki Lambda (which already has the FAISS index on EFS):

  1. Embed the query using MiniLM (loaded at Lambda init or on first search request)
  2. Load FAISS index + sidecar from EFS
  3. Search top K×3 vectors
  4. Deduplicate by page_path, keep best chunk per page
  5. Return top K results with page paths and matching chunk snippets

Latency consideration: Loading MiniLM into the wiki Lambda adds to cold start time. The model import is already ~500ms of the current cold start (via the semantic search plugin). If this proves unacceptable, the search path can also be moved to the embedding Lambda, invoked synchronously from outside the VPC by the MCP Lambda (which is not VPC-bound). This adds a hop but avoids loading MiniLM in the hot path.


Trigger mechanism: why DynamoDB Streams over SQS

The wiki Lambda is VPC-bound (for EFS access). To post a message to SQS from inside the VPC, it would need either a NAT gateway (~\(32/month) or an SQS VPC interface endpoint (~\)7.20/month/AZ). Both defeat the goal of zero fixed costs.

DynamoDB, by contrast, uses a free gateway endpoint that's already deployed. The wiki Lambda already writes to DynamoDB for user/wiki metadata operations. Adding a write to a reindex_queue table (or enabling Streams on an existing table) costs only the per-item DynamoDB write.

The Lambda service polls DynamoDB Streams externally — the embedding Lambda receives the stream record as an invocation event, without needing to reach out to any service itself. No new endpoints.


Reindex queue table

ReindexQueue {
  wiki_id: string,           // partition key: "{owner_id}#{wiki_slug}"
  page_path: string,         // sort key
  action: "upsert" | "delete",
  timestamp: ISO8601
}

DynamoDB Streams is enabled with NEW_IMAGE view type (the embedding Lambda needs the new item to know what to reindex, not the old one).

Idempotency: DynamoDB Streams guarantees at-least-once delivery. The embedding Lambda must be idempotent — re-embedding a page that hasn't changed is wasted work but not incorrect. The Lambda should compare the page's git commit hash against a stored value in the FAISS sidecar metadata to skip unnecessary re-embedding.

Batching: The Lambda event source mapping can be configured with a batch size (e.g., 10 records) and a batching window (e.g., 5 seconds). This means multiple rapid page writes (common during an MCP agent session) are processed in a single Lambda invocation, amortizing the model load time.


Cost model

Item Cost Notes
DynamoDB write (reindex queue) ~$0.00000125/item On-demand pricing
DynamoDB Streams reads $0 Free when consumed by Lambda
Embedding Lambda invocation ~$0.0001–0.001/invocation Depends on batch size and page length
MiniLM model $0 Open source, runs locally
FAISS index storage (EFS) ~$0.016/GB/month A 200-page wiki ≈ 600 vectors × 1.5KB ≈ 1MB
VPC endpoints $0 DynamoDB gateway (free, already deployed)
Total per 200-page wiki < $0.01/month Effectively zero at rest

Compare to the Bedrock approach: ~$14–28/month in fixed VPC endpoint costs before any usage.


Migration from current deployment

The dev deployment (Phase 1) currently uses Bedrock titan-embed-text-v2 for the semantic search plugin on Lambda (P1-6). The FAISS backend and Bedrock adapter were built in P1-2 with a pluggable embedding interface (EMBEDDING_MODEL=local|bedrock).

To migrate:

  1. Switch EMBEDDING_MODEL to local (uses the existing all-MiniLM-L6-v2 adapter)
  2. Deploy the embedding Lambda as a container image (or load model from EFS)
  3. Create the ReindexQueue DynamoDB table with Streams enabled
  4. Add the DynamoDB Streams → Lambda event source mapping in Pulumi
  5. Add the DynamoDB write to the wiki Lambda's page write path
  6. Full reindex of all existing wikis (one-time, triggered manually)
  7. Remove the Bedrock VPC interface endpoint from Pulumi

The FAISS indexes will need to be rebuilt because the vector dimensionality changes (1024 → 384). This is a one-time operation.


Open questions

  1. MiniLM in the wiki Lambda vs. embedding Lambda only. Loading MiniLM adds ~500ms to cold start. If semantic search queries are rare relative to page reads, it may be better to keep MiniLM out of the wiki Lambda entirely and route search requests through the embedding Lambda (invoked synchronously by the MCP Lambda, which is outside the VPC).

  2. Container image vs. EFS model loading. A container image is simpler (everything bundled) but has a ~10s cold start for large images. Loading from EFS means a smaller Lambda package but adds model load time on cold start. Needs benchmarking.

  3. Reindex queue table vs. Streams on existing Wikis table. If the wiki Lambda already updates the Wikis table's last_accessed or page_count on every page write, Streams on that table could trigger the embedding Lambda without a separate table. But this means every DynamoDB write to the Wikis table (including non-page-write operations) triggers the Lambda, which would need filtering. A dedicated table is cleaner.

  4. Event filtering. DynamoDB Streams + Lambda supports event filtering (filter expressions on the stream records). This could be used to filter by action type or deduplicate rapid writes to the same page. Worth investigating to reduce unnecessary Lambda invocations.