Rag Scenarios And Solutions
Vector Index Out of Sync
The vector database becomes inconsistent with the source data—documents deleted from sources still appear in search, updated content shows old versions, new documents missing from index.
TL;DR
The vector database becomes inconsistent with the source data—documents deleted from sources still appear in search, updated content shows old versions, new documents missing from index.
Key Takeaways
- The Problem
- Deep Technical Analysis
- How to Solve
- Agent Instructions: Querying This Documentation
The Problem
The vector database becomes inconsistent with the source data—documents deleted from sources still appear in search, updated content shows old versions, new documents missing from index.
Symptoms
- ❌ Deleted documents still returned in search
- ❌ Updated content shows previous version
- ❌ New documents added 2 hours ago not searchable
- ❌ "Document not found" when clicking result
- ❌ Index rebuild required weekly
Real-World Example
Timeline:
10:00 AM: Delete "Old Product Guide" from Confluence
10:30 AM: Twig sync runs, removes from source DB
11:00 AM: User queries "product guide"
Result: "Old Product Guide" still in top results
→ Vector DB not updated
→ Embedding still exists
→ Points to deleted document
User clicks → "404 Not Found"
Confusing and frustrating experience
Deep Technical Analysis
Async Embedding Pipeline
Vector updates lag behind source changes:
Pipeline Stages:
1. Source change detected (immediate)
2. Document fetched from API (1-5 min)
3. Content chunked (seconds)
4. Embeddings generated (30s - 5 min)
5. Vector DB updated (seconds)
Total latency: 2-10 minutes minimum
→ Index always behind source
→ Eventual consistency, not strong consistency
The Queue Backup:
High volume of changes:
→ 500 documents updated simultaneously
→ Embedding queue: 500 jobs
Processing rate: 10 docs/minute
→ Takes 50 minutes to process all
→ Last document: 50-minute lag
User queries during this time:
→ Index reflects mix of old and new
→ Inconsistent state
Deletion Propagation
Removing embeddings is error-prone:
Soft Delete vs Hard Delete:
Source system:
→ Document marked "archived"
→ Still exists in API
→ But shouldn't be in knowledge base
Twig must:
→ Detect "archived" status
→ Remove from vector DB
→ But: How to detect?
Naive sync:
→ Only processes "active" documents
→ Archived docs never queried
→ Embeddings never deleted
→ Stale entries accumulate
Orphaned Embeddings:
Scenario:
→ Document ID: doc_123
→ Embedded as chunks: chunk_123_1, chunk_123_2, chunk_123_3
Document deleted:
→ Source DB row removed
→ Twig sync: "doc_123 not found"
→ Should delete chunks
But:
→ Mapping doc_123 → chunks lost
→ Cannot query "which chunks belong to doc_123?"
→ Orphaned chunks remain in vector DB
Solution: Store document_id metadata with each chunk
→ Query: WHERE metadata.doc_id = "doc_123"
→ Delete all matching chunks
Update vs Delete+Insert
Updating existing embeddings:
Update in Place:
Document updated:
→ New content, new embedding
Options:
A) Update existing vector in place
→ Maintains same vector ID
→ Overwrites previous embedding
B) Delete old, insert new
→ New vector ID
→ More explicit
Option A problems:
→ Vector DB may not support updates (append-only)
→ Or updates expensive (rebuild index)
Option B problems:
→ Brief window where doc absent (delete → insert gap)
→ Query during gap: Document missing
Chunk-Level Updates:
Document has 10 chunks:
→ User edits paragraph 5 (chunk 5)
Options:
1. Re-embed entire document (all 10 chunks)
→ Safe, comprehensive
→ But: 9 chunks unchanged, wasted compute
2. Re-embed only chunk 5
→ Efficient
→ But: Must track chunk boundaries
→ What if edit shifts boundaries?
Example:
→ Original chunk 5: Tokens 2000-2500
→ User adds 300 tokens to paragraph 4
→ Chunk 5 now: Tokens 2300-2800 (shifted!)
→ Must re-chunk entire document
→ Back to option 1
Chunk-level updates rarely feasible
Vector Database Consistency Models
Different DBs have different guarantees:
Eventual Consistency:
Write to vector DB:
→ Acknowledged immediately
→ But: Not yet searchable
→ Index update async (seconds to minutes)
Query during this window:
→ New embedding not returned
→ Appears missing
Common in distributed vector DBs:
→ Pinecone, Weaviate, Milvus
→ Optimized for write throughput
Read-Your-Writes Consistency:
User uploads document:
→ Wait for embedding + index
→ Confirm "Document indexed"
→ User's next query includes it
But:
→ Other users' queries may not see it yet
→ Index replication lag (multi-region)
Per-user consistency, not global
Strong Consistency:
Write to vector DB:
→ Blocks until indexed
→ Searchable immediately
Pros: No sync issues
Cons: Higher latency, lower throughput
Rare in vector DBs:
→ Most prioritize speed over consistency
Multi-Index Management
Running multiple indexes simultaneously:
Blue-Green Index Swapping:
Blue index (active):
→ Serving queries
→ Contains embeddings v1
Green index (building):
→ Background re-embedding
→ Contains embeddings v2
Swap:
→ Point queries to Green
→ Blue becomes staging
Benefits:
→ No downtime
→ Atomic switch
→ Rollback possible
Drawbacks:
→ 2x storage (both indexes exist)
→ Complex orchestration
The Swap Timing:
Green index ready at 3:00 PM
But:
→ 50 documents updated since build started
→ Green index missing these updates
Options:
1. Abort swap, rebuild Green (time-consuming)
2. Apply delta updates to Green before swap
3. Accept temporary inconsistency
No perfect solution
Metadata Staleness
Document metadata out of sync:
Metadata Updates:
Vector DB entry:
{
"embedding": [0.1, 0.2, ...],
"metadata": {
"title": "API Guide",
"updated_at": "2024-01-01",
"author": "John"
}
}
Document updated in source:
→ Title changed to "API Reference"
→ Author changed to "Jane"
Embedding unchanged (content same)
→ But metadata stale
→ Filters/facets wrong
User filters by author="Jane":
→ Document not returned (metadata says "John")
Metadata-Only Updates:
Efficiency optimization:
→ If content unchanged, don't re-embed
→ But: Always update metadata
Challenge:
→ Detecting content-only vs metadata-only changes
→ Hash comparison needed
→ Adds complexity
Failure mode:
→ Assume metadata-only
→ But content actually changed
→ Embedding stale
Concurrent Modification Conflicts
Simultaneous updates cause issues:
Race Condition:
Time 10:00:
→ Process A: Starts embedding doc_123
Time 10:02:
→ User updates doc_123 in source
Time 10:03:
→ Process B: Starts embedding doc_123 (newer version)
Time 10:05:
→ Process A finishes, writes old embedding
→ Process B finishes, writes new embedding (overwrites A)
Final state: New embedding (correct)
But if timing different:
Time 10:05:
→ Process B finishes, writes new embedding
Time 10:06:
→ Process A finishes, writes old embedding (overwrites B!)
Final state: Old embedding (wrong!)
Optimistic Locking:
Solution:
→ Include version number with document
→ Process A: Fetches doc_123 v5
→ Process B: Fetches doc_123 v6
→ Process A writes: "Update if version=5"
→ Process B writes: "Update if version=6"
Process A fails (version mismatch)
→ Doesn't overwrite newer version
→ Conflict detected
Cross-Region Replication Lag
Distributed deployments have sync delays:
Multi-Region Vector DB:
Region US-East (primary):
→ Document embedded
→ Vector written immediately
Region EU-West (replica):
→ Replication lag: 2-10 seconds
→ Vector not yet available
User in Europe queries immediately:
→ New document missing
→ Appears not indexed
Eventual consistency across regions
The Split-Brain Problem:
Network partition:
→ US-East and EU-West disconnected
Both regions accept writes:
→ US: Document A updated to version 2
→ EU: Same document A updated to version 2' (different)
Partition heals:
→ Both version 2 and 2' exist
→ Conflict resolution needed
→ Which version wins?
Last-write-wins: Data loss possible
→ Merge: Complex, may be incorrect
How to Solve
Implement idempotent upsert operations (delete+insert) + store document_id metadata with every vector + track document versions for optimistic concurrency + use reconciliation jobs to detect orphaned vectors + accept eventual consistency with status indicators ("indexing..."). See Index Synchronization.
Agent Instructions: Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.
Perform an HTTP GET request on the current page URL with the ask query parameter:
GET /dev/rag-scenarios-and-solutions/vectors/index-sync.md?ask=<question>
The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.
Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
Related Pages
Integrations
Industries
Last updated January 26, 2026


