Rag Scenarios And Solutions
Duplicate Content in Vector DB
Same or highly similar content embedded multiple times, wasting storage, increasing costs, and causing repetitive or confused AI responses.
TL;DR
Same or highly similar content embedded multiple times, wasting storage, increasing costs, and causing repetitive or confused AI responses.
Key Takeaways
- The Problem
- Deep Technical Analysis
- How to Solve
- Agent Instructions: Querying This Documentation
The Problem
Same or highly similar content embedded multiple times, wasting storage, increasing costs, and causing repetitive or confused AI responses.
Symptoms
- ❌ Same answer repeated multiple times
- ❌ Storage costs higher than expected
- ❌ Retrieval returns near-identical chunks
- ❌ AI cites same source 3+ times
- ❌ Multiple versions of same doc embedded
Real-World Example
Knowledge base contains:
→ "FAQ v1.pdf" (ingested January)
→ "FAQ v2.pdf" (ingested March, 90% overlap with v1)
→ "FAQ-copy.pdf" (duplicate file, different name)
Query: "How to reset password?"
Retrieved chunks:
→ Chunk A (FAQ v1): "Reset password: Click forgot password..."
→ Chunk B (FAQ v2): "Reset password: Click forgot password..." (identical)
→ Chunk C (FAQ-copy): "Reset password: Click forgot password..." (identical)
AI response cites all three (redundant)
Storage: 3x cost for same content
Deep Technical Analysis
Sources of Duplication
Document Re-Ingestion:
Common scenario:
→ Ingest doc_v1.pdf
→ Update to doc_v2.pdf
→ Re-ingest without deleting v1
→ Both versions coexist
Result: Duplicate + outdated data
Cross-Source Duplication:
Same content in multiple places:
→ Help Center article
→ Internal wiki (copy/paste of article)
→ PDF export of article
All ingested → 3x duplicate
Chunking Overlap:
Sliding window chunking:
→ Chunk 1: Tokens 0-500 (with 10% overlap)
→ Chunk 2: Tokens 450-950
→ Overlap: Tokens 450-500 duplicated
Some overlap intentional (context preservation)
Too much overlap = duplication
Detection Strategies
Exact Duplicate Detection:
Hash-based:
→ Hash each chunk text (MD5, SHA-256)
→ Store hash
→ Before inserting, check if hash exists
Fast, catches exact duplicates
Misses: Paraphrases, minor edits
Semantic Duplicate Detection:
Cosine similarity between embeddings:
→ If similarity > 0.95 (very high)
→ Likely duplicate/near-duplicate
Example:
→ "Reset your password" vs "Reset password"
→ Different text, same meaning
→ Embeddings very similar
→ Flag as duplicate
Fuzzy Matching:
Levenshtein distance:
→ Edit distance between texts
→ If distance < 5% of length
→ Near-duplicate
Catches typos, minor rephrasing
Deduplication Strategies
Pre-Ingestion Dedup:
Before embedding:
1. Hash new chunks
2. Check against existing hashes
3. Skip if duplicate
Prevents ingestion entirely
Most efficient
Post-Ingestion Dedup:
After ingestion:
1. Compute pairwise similarities
2. Identify duplicates (similarity > 0.95)
3. Delete lower-priority duplicates
Use when:
→ Cleanup needed
→ Legacy data has duplicates
Version-Aware Ingestion:
Track document versions:
{
document_id: "FAQ",
version: 2,
chunks: [...]
}
On re-ingest:
→ Delete chunks where document_id="FAQ" AND version < 2
→ Add new version
Automatic cleanup
Storage Impact
Cost Calculation:
10,000 unique chunks
20% duplication rate → 2,000 duplicates
Storage:
→ 12,000 vectors vs 10,000 (20% extra cost)
Embedding cost:
→ 2,000 duplicate embeddings generated
→ Wasted API calls
Retrieval:
→ More data to search → slightly slower
How to Solve
Implement hash-based exact duplicate detection at ingestion + run semantic similarity deduplication (cosine > 0.95) periodically + use version-aware ingestion (delete old versions on update) + track document_id and version metadata + prefer single source of truth (don't ingest same content from multiple sources) + monitor duplicate rate metric. See Duplicate Management.
Agent Instructions: Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.
Perform an HTTP GET request on the current page URL with the ask query parameter:
GET /dev/rag-scenarios-and-solutions/data-quality/duplicate-content.md?ask=<question>
The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.
Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
Related Pages
Last updated January 26, 2026


