Rag Scenarios And Solutions
Vector Database Performance
Vector similarity search becomes slow as database grows—queries take seconds instead of milliseconds, index builds timeout, and memory usage spirals out of control.
TL;DR
Vector similarity search becomes slow as database grows—queries take seconds instead of milliseconds, index builds timeout, and memory usage spirals out of control.
Key Takeaways
- The Problem
- Deep Technical Analysis
- How to Solve
- Agent Instructions: Querying This Documentation
The Problem
Vector similarity search becomes slow as database grows—queries take seconds instead of milliseconds, index builds timeout, and memory usage spirals out of control.
Symptoms
- ❌ Query latency >1 second (was <100ms)
- ❌ Index build takes hours
- ❌ Out-of-memory errors
- ❌ CPU usage spikes during queries
- ❌ Throughput drops as DB grows
Real-World Example
Vector DB performance degradation:
10K vectors: 50ms query time ✓
100K vectors: 150ms query time ✓
1M vectors: 800ms query time ⚠️
10M vectors: 3+ seconds query time ✗
User experience degrades:
→ Page loads feel slow
→ Real-time chat delayed
→ Users frustrated
Cause: O(n) brute-force search doesn't scale
Need approximate search algorithms
Deep Technical Analysis
Brute-Force vs Approximate Search
Exact vs approximate nearest neighbors:
Brute-Force (Exact):
Algorithm:
For each vector in database:
→ Compute cosine similarity with query
→ Track top-K
Complexity: O(n × d)
→ n = number of vectors
→ d = dimensions
1M vectors × 1536 dims:
→ 1.5 billion operations per query
→ ~500ms on modern CPU
Doesn't scale beyond 1M vectors
Approximate Nearest Neighbor (ANN):
Algorithms: HNSW, IVF, NSG, etc.
Trade accuracy for speed:
→ May miss some true nearest neighbors
→ But: Returns "good enough" results
→ 10-100x faster
Complexity: O(log n) (HNSW)
→ Logarithmic growth
→ Scales to billions of vectors
10M vectors: ~100ms query time
→ Acceptable for user-facing apps
HNSW Index Structure
Hierarchical Navigable Small World graphs:
Graph Construction:
Build multi-layer graph:
→ Layer 0: All vectors (dense)
→ Layer 1: 50% of vectors
→ Layer 2: 25% of vectors
→ Top layer: Few vectors
Search starts at top:
→ Quickly narrow region
→ Descend to lower layers
→ Refine search
Log(n) hops to find neighbors
The Memory Trade-off:
HNSW stores:
→ Vectors themselves (n × d × 4 bytes)
→ Graph edges (n × M × 4 bytes)
→ M = connections per node (16-64)
10M vectors × 1536 dims × 4 bytes = 58 GB (vectors)
10M vectors × 32 edges × 4 bytes = 1.2 GB (graph)
Total: ~60 GB RAM
High memory requirement
→ Expensive infrastructure
→ But: Fast queries
Build Time Challenge:
Inserting 10M vectors:
→ Each insert: O(log n) + edge updates
→ Total: O(n log n)
→ ~2-4 hours on standard hardware
During build:
→ Index not queryable
→ Or: Degraded performance
→ Requires maintenance windows
Index-Vector-Flat (IVF) Approach
Clustering-based search:
Clustering Strategy:
1. Cluster vectors into N groups (k-means)
2. Store cluster centroids
3. Assign each vector to nearest centroid
Query:
1. Find nearest K centroids to query (fast)
2. Search only vectors in those K clusters
3. Return top results
Reduces search space:
→ Search K/N of database
→ If K=10, N=1000: Search 1% of vectors
→ 100x speedup
The Re-Ranking Pattern:
Two-stage search:
1. IVF approximate search → 100 candidates
2. Exact brute-force on 100 → top-10
Fast first stage (approximate)
+ Accurate second stage (exact)
= Good balance
Total time: 20ms + 5ms = 25ms
→ vs 500ms brute-force
→ 20x faster
Quantization for Memory Reduction
Compress vectors to use less RAM:
Product Quantization:
Original vector: 1536 floats × 4 bytes = 6 KB
Quantized: 1536 × 1 byte = 1.5 KB
→ 75% memory savings
How:
→ Divide 1536 dims into 8 subspaces (192 dims each)
→ Cluster each subspace (256 clusters)
→ Store cluster ID (1 byte) instead of 192 floats
Retrieval:
→ Approximate similarity from cluster IDs
→ Re-rank top candidates with exact vectors
Accuracy-Speed Trade-off:
Float32 (full precision):
→ Exact similarity
→ 6 KB per vector
→ Slowest
Float16 (half precision):
→ Slight accuracy loss (~0.1%)
→ 3 KB per vector
→ 2x memory savings
Int8 (product quantization):
→ Moderate accuracy loss (~2-5%)
→ 1.5 KB per vector
→ 4x memory savings
Binary (extreme):
→ Significant accuracy loss (~10-15%)
→ 192 bytes per vector
→ 32x memory savings
Choose based on requirements
Sharding and Distribution
Scale horizontally:
Database Sharding:
10M vectors split across 10 shards:
→ Shard 1: 1M vectors
→ Shard 2: 1M vectors
→ ...
→ Shard 10: 1M vectors
Query:
→ Parallel search across all shards
→ Each returns top-10
→ Merge results → final top-10
Latency: Max(shard query times)
→ Not sum
→ Parallelization benefit
The Hot Shard Problem:
Uneven data distribution:
→ Shard 1: Popular documents (high QPS)
→ Shard 10: Rarely queried documents
Shard 1 becomes bottleneck:
→ Overloaded
→ Slower responses
→ Drags down overall latency
Solution:
→ Rebalance shards periodically
→ Or: Hash-based distribution
→ Or: Replicate hot shards
Write Amplification
Index updates are expensive:
Single Vector Insert:
HNSW insert:
1. Find insertion point (log n hops)
2. Add vector to layer 0
3. Create M edges to neighbors
4. Update neighbor edges (reciprocal)
5. Potentially propagate to upper layers
Operations: 50-100 per insert
→ vs 1 operation for raw storage
Write amplification: 50-100x
Bulk vs Incremental:
Incremental updates:
→ Add 1 vector at a time
→ Rebuild local graph structure
→ Fast per-vector
→ But: Index quality degrades over time
Bulk rebuild:
→ Build entire index from scratch
→ Optimal structure
→ Slow (hours)
→ Downtime or dual-index strategy
Query Optimization
Improve search performance:
Batch Queries:
Process queries in batches:
→ 10 queries at once
→ Amortize index traversal overhead
→ Better CPU cache utilization
Single query: 100ms
Batch of 10: 400ms (40ms each)
→ 2.5x throughput improvement
Trade-off: Adds latency (wait for batch)
→ Good for background processing
→ Bad for user-facing queries
Prefiltering vs Postfiltering:
Query with metadata filter:
"Find similar docs WHERE category='API'"
Prefilter approach:
1. Filter vectors by metadata
2. Build temporary index on filtered set
3. Search filtered index
Postfilter approach:
1. Search entire index
2. Get top-100 candidates
3. Filter by metadata
4. Return top-10 after filtering
Prefilter: Better if filter is selective (10% match)
Postfilter: Better if filter is broad (90% match)
Monitoring and Diagnosis
Track performance metrics:
Key Metrics:
Latency percentiles:
→ p50: 50ms (median)
→ p90: 120ms (90th percentile)
→ p99: 500ms (99th percentile)
Why percentiles?
→ Average hides outliers
→ p99 affects user experience
Tail latency:
→ 1% of queries take 500ms
→ 1 in 100 users frustrated
→ p99 latency critical
Degradation Signals:
Warning signs:
→ Latency creeping up over time
→ Memory usage increasing
→ Query throughput decreasing
Causes:
→ Index fragmentation (needs rebuild)
→ Hot spots (rebalance needed)
→ Memory pressure (swap, OOM)
Proactive monitoring prevents outages
How to Solve
Use approximate search (HNSW, IVF) instead of brute-force + implement product quantization for memory savings + shard database horizontally + batch queries where possible + rebuild indexes periodically + monitor p99 latency + use SSD for disk-based indexes. See Vector DB Performance.
Agent Instructions: Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.
Perform an HTTP GET request on the current page URL with the ask query parameter:
GET /dev/rag-scenarios-and-solutions/vectors/vector-db-performance.md?ask=<question>
The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.
Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
Related Pages
Last updated January 26, 2026


