Rag Scenarios And Solutions
GDPR Right to Forget in Vector DB
When users request data deletion under GDPR Article 17, removing their data from vector embeddings is technically complex and often incomplete.
TL;DR
When users request data deletion under GDPR Article 17, removing their data from vector embeddings is technically complex and often incomplete.
Key Takeaways
- The Problem
- Deep Technical Analysis
- How to Solve
- Agent Instructions: Querying This Documentation
The Problem
When users request data deletion under GDPR Article 17, removing their data from vector embeddings is technically complex and often incomplete.
Symptoms
- ❌ Cannot locate user's vectors
- ❌ Text deleted but embeddings remain
- ❌ No mapping: user data → vectors
- ❌ Partial deletion leaves traces
- ❌ Cannot prove complete erasure
Real-World Example
User requests deletion:
"Delete all my data per GDPR Article 17"
Company deletes:
→ Source documents from document DB ✓
→ User account from auth DB ✓
But vector DB still contains:
→ Embeddings of user's emails
→ Chunks mentioning user's name
→ Context where user participated
How to find and delete these vectors?
→ No direct identifier linking vectors to user
→ Cannot execute complete erasure
→ GDPR violation
Deep Technical Analysis
Vector-to-Source Mapping Problem
Embedding Anonymity:
Text: "Email from john.smith@example.com regarding..."
→ Embedded as: [0.234, -0.567, 0.891, ...]
→ Vector has no inherent link to "john.smith"
Deletion request:
→ Search vectors for john.smith?
→ Semantic search might miss variations
→ Cannot guarantee finding all vectors
Metadata Dependency:
Solution: Store metadata with vectors:
{
vector: [0.234, ...],
metadata: {
user_id: "12345",
document_id: "doc789",
source: "email"
}
}
Enables:
→ Query: "Find all vectors where user_id=12345"
→ Delete matching vectors
→ But: Metadata must be comprehensive
Secondary References:
User's data appears in others' content:
→ Email TO john.smith (from someone else)
→ Comments mentioning john.smith
→ Collaborative docs with john.smith's edits
Delete these too?
→ GDPR says: Yes, if identifies user
→ But: Hard to detect all references
Deletion Strategies
Metadata Filtering:
1. Tag all chunks with user identifiers
2. On deletion request:
DELETE FROM vector_db
WHERE metadata->user_id = '12345'
3. Verify: Count remaining matches
Requires:
→ Comprehensive tagging at ingestion
→ Query capability by metadata
→ Not all vector DBs support this efficiently
Re-Embedding After Deletion:
1. Delete source documents with user data
2. Trigger re-ingestion of entire knowledge base
3. Rebuild vector index from scratch
Pros:
+ Guaranteed complete removal
+ No orphaned vectors
Cons:
- Expensive (re-embed everything)
- Downtime during rebuild
- Not scalable for frequent deletions
Soft Deletion:
Don't actually delete vectors:
→ Mark as deleted in metadata
→ Filter out at query time
Pros:
+ Reversible (backup/recovery)
+ Fast
Cons:
- Data still exists (GDPR violation?)
- Requires filtering layer
- Storage still used
Vector DB Capabilities
Deletion Support by Platform:
Pinecone:
→ Delete by ID: Yes
→ Delete by metadata filter: Yes
→ Batch deletion: Yes
Weaviate:
→ Delete by filter: Yes
→ Cascade deletion: Yes
Chroma:
→ Delete by ID: Yes
→ Filter-based: Limited
PostgreSQL + pgvector:
→ Standard SQL DELETE
→ Full filtering support
Performance Concerns:
Large-scale deletion:
→ Delete 100,000 vectors for one user
→ May lock index
→ Impact query performance
→ Require maintenance window
Audit Trail
Proving Deletion:
GDPR requires proof:
→ Log deletion timestamp
→ Count vectors before/after
→ Store deletion certificate
Example log:
{
user_id: "12345",
deletion_requested: "2024-01-15T10:00:00Z",
vectors_deleted: 15234,
documents_deleted: 89,
completed: "2024-01-15T10:15:23Z",
verified_by: "admin@company.com"
}
How to Solve
Tag all vectors with user/document IDs at ingestion + implement metadata-based deletion (DELETE WHERE user_id=X) + perform semantic search for residual references + maintain audit log of deletions + consider re-indexing for guaranteed erasure + verify deletion with count queries. See GDPR Compliance.
Agent Instructions: Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.
Perform an HTTP GET request on the current page URL with the ask query parameter:
GET /dev/rag-scenarios-and-solutions/privacy/gdpr-compliance.md?ask=<question>
The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.
Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
Related Pages
Integrations
Last updated January 26, 2026


