Vector Index Out of Sync

The Problem

The vector database becomes inconsistent with the source data—documents deleted from sources still appear in search, updated content shows old versions, new documents missing from index.

Symptoms

❌ Deleted documents still returned in search
❌ Updated content shows previous version
❌ New documents added 2 hours ago not searchable
❌ "Document not found" when clicking result
❌ Index rebuild required weekly

Real-World Example

Timeline:
10:00 AM: Delete "Old Product Guide" from Confluence
10:30 AM: Twig sync runs, removes from source DB
11:00 AM: User queries "product guide"

Result: "Old Product Guide" still in top results
→ Vector DB not updated
→ Embedding still exists
→ Points to deleted document

User clicks → "404 Not Found"
Confusing and frustrating experience

Deep Technical Analysis

Async Embedding Pipeline

Vector updates lag behind source changes:

Pipeline Stages:

1. Source change detected (immediate)
2. Document fetched from API (1-5 min)
3. Content chunked (seconds)
4. Embeddings generated (30s - 5 min)
5. Vector DB updated (seconds)

Total latency: 2-10 minutes minimum
→ Index always behind source
→ Eventual consistency, not strong consistency

The Queue Backup:

High volume of changes:
→ 500 documents updated simultaneously
→ Embedding queue: 500 jobs

Processing rate: 10 docs/minute
→ Takes 50 minutes to process all
→ Last document: 50-minute lag

User queries during this time:
→ Index reflects mix of old and new
→ Inconsistent state

Deletion Propagation

Removing embeddings is error-prone:

Soft Delete vs Hard Delete:

Source system:
→ Document marked "archived"
→ Still exists in API
→ But shouldn't be in knowledge base

Twig must:
→ Detect "archived" status
→ Remove from vector DB
→ But: How to detect?

Naive sync:
→ Only processes "active" documents
→ Archived docs never queried
→ Embeddings never deleted
→ Stale entries accumulate

Orphaned Embeddings:

Scenario:
→ Document ID: doc_123
→ Embedded as chunks: chunk_123_1, chunk_123_2, chunk_123_3

Document deleted:
→ Source DB row removed
→ Twig sync: "doc_123 not found"
→ Should delete chunks

But:
→ Mapping doc_123 → chunks lost
→ Cannot query "which chunks belong to doc_123?"
→ Orphaned chunks remain in vector DB

Solution: Store document_id metadata with each chunk
→ Query: WHERE metadata.doc_id = "doc_123"
→ Delete all matching chunks

Update vs Delete+Insert

Updating existing embeddings:

Update in Place:

Document updated:
→ New content, new embedding

Options:
A) Update existing vector in place
   → Maintains same vector ID
   → Overwrites previous embedding

B) Delete old, insert new
   → New vector ID
   → More explicit

Option A problems:
→ Vector DB may not support updates (append-only)
→ Or updates expensive (rebuild index)

Option B problems:
→ Brief window where doc absent (delete → insert gap)
→ Query during gap: Document missing

Chunk-Level Updates:

Document has 10 chunks:
→ User edits paragraph 5 (chunk 5)

Options:
1. Re-embed entire document (all 10 chunks)
   → Safe, comprehensive
   → But: 9 chunks unchanged, wasted compute

2. Re-embed only chunk 5
   → Efficient
   → But: Must track chunk boundaries
   → What if edit shifts boundaries?

Example:
→ Original chunk 5: Tokens 2000-2500
→ User adds 300 tokens to paragraph 4
→ Chunk 5 now: Tokens 2300-2800 (shifted!)
→ Must re-chunk entire document
→ Back to option 1

Chunk-level updates rarely feasible

Vector Database Consistency Models

Different DBs have different guarantees:

Eventual Consistency:

Write to vector DB:
→ Acknowledged immediately
→ But: Not yet searchable
→ Index update async (seconds to minutes)

Query during this window:
→ New embedding not returned
→ Appears missing

Common in distributed vector DBs:
→ Pinecone, Weaviate, Milvus
→ Optimized for write throughput

Read-Your-Writes Consistency:

User uploads document:
→ Wait for embedding + index
→ Confirm "Document indexed"
→ User's next query includes it

But:
→ Other users' queries may not see it yet
→ Index replication lag (multi-region)

Per-user consistency, not global

Strong Consistency:

Write to vector DB:
→ Blocks until indexed
→ Searchable immediately

Pros: No sync issues
Cons: Higher latency, lower throughput

Rare in vector DBs:
→ Most prioritize speed over consistency

Multi-Index Management

Running multiple indexes simultaneously:

Blue-Green Index Swapping:

Blue index (active):
→ Serving queries
→ Contains embeddings v1

Green index (building):
→ Background re-embedding
→ Contains embeddings v2

Swap:
→ Point queries to Green
→ Blue becomes staging

Benefits:
→ No downtime
→ Atomic switch
→ Rollback possible

Drawbacks:
→ 2x storage (both indexes exist)
→ Complex orchestration

The Swap Timing:

Green index ready at 3:00 PM
But:
→ 50 documents updated since build started
→ Green index missing these updates

Options:
1. Abort swap, rebuild Green (time-consuming)
2. Apply delta updates to Green before swap
3. Accept temporary inconsistency

No perfect solution

Metadata Staleness

Document metadata out of sync:

Metadata Updates:

Vector DB entry:
{
  "embedding": [0.1, 0.2, ...],
  "metadata": {
    "title": "API Guide",
    "updated_at": "2024-01-01",
    "author": "John"
  }
}

Document updated in source:
→ Title changed to "API Reference"
→ Author changed to "Jane"

Embedding unchanged (content same)
→ But metadata stale
→ Filters/facets wrong

User filters by author="Jane":
→ Document not returned (metadata says "John")

Metadata-Only Updates:

Efficiency optimization:
→ If content unchanged, don't re-embed
→ But: Always update metadata

Challenge:
→ Detecting content-only vs metadata-only changes
→ Hash comparison needed
→ Adds complexity

Failure mode:
→ Assume metadata-only
→ But content actually changed
→ Embedding stale

Concurrent Modification Conflicts

Simultaneous updates cause issues:

Race Condition:

Time 10:00:
→ Process A: Starts embedding doc_123

Time 10:02:
→ User updates doc_123 in source

Time 10:03:
→ Process B: Starts embedding doc_123 (newer version)

Time 10:05:
→ Process A finishes, writes old embedding
→ Process B finishes, writes new embedding (overwrites A)

Final state: New embedding (correct)

But if timing different:
Time 10:05:
→ Process B finishes, writes new embedding
Time 10:06:
→ Process A finishes, writes old embedding (overwrites B!)

Final state: Old embedding (wrong!)

Optimistic Locking:

Solution:
→ Include version number with document
→ Process A: Fetches doc_123 v5
→ Process B: Fetches doc_123 v6
→ Process A writes: "Update if version=5"
→ Process B writes: "Update if version=6"

Process A fails (version mismatch)
→ Doesn't overwrite newer version
→ Conflict detected

Cross-Region Replication Lag

Distributed deployments have sync delays:

Multi-Region Vector DB:

Region US-East (primary):
→ Document embedded
→ Vector written immediately

Region EU-West (replica):
→ Replication lag: 2-10 seconds
→ Vector not yet available

User in Europe queries immediately:
→ New document missing
→ Appears not indexed

Eventual consistency across regions

The Split-Brain Problem:

Network partition:
→ US-East and EU-West disconnected

Both regions accept writes:
→ US: Document A updated to version 2
→ EU: Same document A updated to version 2' (different)

Partition heals:
→ Both version 2 and 2' exist
→ Conflict resolution needed
→ Which version wins?

Last-write-wins: Data loss possible
→ Merge: Complex, may be incorrect

How to Solve

Implement idempotent upsert operations (delete+insert) + store document_id metadata with every vector + track document versions for optimistic concurrency + use reconciliation jobs to detect orphaned vectors + accept eventual consistency with status indicators ("indexing..."). See Index Synchronization.

Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the ask query parameter:

GET /dev/rag-scenarios-and-solutions/vectors/index-sync.md?ask=<question>

The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.

Vector Index Out of Sync

Key Takeaways