Rag Scenarios And Solutions

Duplicate Content in Vector DB

Same or highly similar content embedded multiple times, wasting storage, increasing costs, and causing repetitive or confused AI responses.

TL;DR

Same or highly similar content embedded multiple times, wasting storage, increasing costs, and causing repetitive or confused AI responses.

Key Takeaways

  • The Problem
  • Deep Technical Analysis
  • How to Solve
  • Agent Instructions: Querying This Documentation

The Problem

Same or highly similar content embedded multiple times, wasting storage, increasing costs, and causing repetitive or confused AI responses.

Symptoms

  • ❌ Same answer repeated multiple times
  • ❌ Storage costs higher than expected
  • ❌ Retrieval returns near-identical chunks
  • ❌ AI cites same source 3+ times
  • ❌ Multiple versions of same doc embedded

Real-World Example

Knowledge base contains:
→ "FAQ v1.pdf" (ingested January)
→ "FAQ v2.pdf" (ingested March, 90% overlap with v1)
→ "FAQ-copy.pdf" (duplicate file, different name)

Query: "How to reset password?"

Retrieved chunks:
→ Chunk A (FAQ v1): "Reset password: Click forgot password..."
→ Chunk B (FAQ v2): "Reset password: Click forgot password..." (identical)
→ Chunk C (FAQ-copy): "Reset password: Click forgot password..." (identical)

AI response cites all three (redundant)
Storage: 3x cost for same content

Deep Technical Analysis

Sources of Duplication

Document Re-Ingestion:

Common scenario:
→ Ingest doc_v1.pdf
→ Update to doc_v2.pdf
→ Re-ingest without deleting v1
→ Both versions coexist

Result: Duplicate + outdated data

Cross-Source Duplication:

Same content in multiple places:
→ Help Center article
→ Internal wiki (copy/paste of article)
→ PDF export of article

All ingested → 3x duplicate

Chunking Overlap:

Sliding window chunking:
→ Chunk 1: Tokens 0-500 (with 10% overlap)
→ Chunk 2: Tokens 450-950
→ Overlap: Tokens 450-500 duplicated

Some overlap intentional (context preservation)
Too much overlap = duplication

Detection Strategies

Exact Duplicate Detection:

Hash-based:
→ Hash each chunk text (MD5, SHA-256)
→ Store hash
→ Before inserting, check if hash exists

Fast, catches exact duplicates
Misses: Paraphrases, minor edits

Semantic Duplicate Detection:

Cosine similarity between embeddings:
→ If similarity > 0.95 (very high)
→ Likely duplicate/near-duplicate

Example:
→ "Reset your password" vs "Reset password"
→ Different text, same meaning
→ Embeddings very similar
→ Flag as duplicate

Fuzzy Matching:

Levenshtein distance:
→ Edit distance between texts
→ If distance < 5% of length
→ Near-duplicate

Catches typos, minor rephrasing

Deduplication Strategies

Pre-Ingestion Dedup:

Before embedding:
1. Hash new chunks
2. Check against existing hashes
3. Skip if duplicate

Prevents ingestion entirely
Most efficient

Post-Ingestion Dedup:

After ingestion:
1. Compute pairwise similarities
2. Identify duplicates (similarity > 0.95)
3. Delete lower-priority duplicates

Use when:
→ Cleanup needed
→ Legacy data has duplicates

Version-Aware Ingestion:

Track document versions:
{
  document_id: "FAQ",
  version: 2,
  chunks: [...]
}

On re-ingest:
→ Delete chunks where document_id="FAQ" AND version < 2
→ Add new version

Automatic cleanup

Storage Impact

Cost Calculation:

10,000 unique chunks
20% duplication rate → 2,000 duplicates

Storage:
→ 12,000 vectors vs 10,000 (20% extra cost)

Embedding cost:
→ 2,000 duplicate embeddings generated
→ Wasted API calls

Retrieval:
→ More data to search → slightly slower

How to Solve

Implement hash-based exact duplicate detection at ingestion + run semantic similarity deduplication (cosine > 0.95) periodically + use version-aware ingestion (delete old versions on update) + track document_id and version metadata + prefer single source of truth (don't ingest same content from multiple sources) + monitor duplicate rate metric. See Duplicate Management.


Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the ask query parameter:

GET /dev/rag-scenarios-and-solutions/data-quality/duplicate-content.md?ask=<question>

The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.

Related Pages

Last updated January 26, 2026