GDPR Right to Forget in Vector DB

The Problem

When users request data deletion under GDPR Article 17, removing their data from vector embeddings is technically complex and often incomplete.

Symptoms

❌ Cannot locate user's vectors
❌ Text deleted but embeddings remain
❌ No mapping: user data → vectors
❌ Partial deletion leaves traces
❌ Cannot prove complete erasure

Real-World Example

User requests deletion:
"Delete all my data per GDPR Article 17"

Company deletes:
→ Source documents from document DB ✓
→ User account from auth DB ✓

But vector DB still contains:
→ Embeddings of user's emails
→ Chunks mentioning user's name
→ Context where user participated

How to find and delete these vectors?
→ No direct identifier linking vectors to user
→ Cannot execute complete erasure
→ GDPR violation

Deep Technical Analysis

Vector-to-Source Mapping Problem

Embedding Anonymity:

Text: "Email from john.smith@example.com regarding..."
→ Embedded as: [0.234, -0.567, 0.891, ...]
→ Vector has no inherent link to "john.smith"

Deletion request:
→ Search vectors for john.smith?
→ Semantic search might miss variations
→ Cannot guarantee finding all vectors

Metadata Dependency:

Solution: Store metadata with vectors:
{
  vector: [0.234, ...],
  metadata: {
    user_id: "12345",
    document_id: "doc789",
    source: "email"
  }
}

Enables:
→ Query: "Find all vectors where user_id=12345"
→ Delete matching vectors
→ But: Metadata must be comprehensive

Secondary References:

User's data appears in others' content:
→ Email TO john.smith (from someone else)
→ Comments mentioning john.smith
→ Collaborative docs with john.smith's edits

Delete these too?
→ GDPR says: Yes, if identifies user
→ But: Hard to detect all references

Deletion Strategies

Metadata Filtering:

1. Tag all chunks with user identifiers
2. On deletion request:
   DELETE FROM vector_db 
   WHERE metadata->user_id = '12345'
3. Verify: Count remaining matches

Requires:
→ Comprehensive tagging at ingestion
→ Query capability by metadata
→ Not all vector DBs support this efficiently

Re-Embedding After Deletion:

1. Delete source documents with user data
2. Trigger re-ingestion of entire knowledge base
3. Rebuild vector index from scratch

Pros:
+ Guaranteed complete removal
+ No orphaned vectors

Cons:
- Expensive (re-embed everything)
- Downtime during rebuild
- Not scalable for frequent deletions

Soft Deletion:

Don't actually delete vectors:
→ Mark as deleted in metadata
→ Filter out at query time

Pros:
+ Reversible (backup/recovery)
+ Fast

Cons:
- Data still exists (GDPR violation?)
- Requires filtering layer
- Storage still used

Vector DB Capabilities

Deletion Support by Platform:

Pinecone:
→ Delete by ID: Yes
→ Delete by metadata filter: Yes
→ Batch deletion: Yes

Weaviate:
→ Delete by filter: Yes
→ Cascade deletion: Yes

Chroma:
→ Delete by ID: Yes
→ Filter-based: Limited

PostgreSQL + pgvector:
→ Standard SQL DELETE
→ Full filtering support

Performance Concerns:

Large-scale deletion:
→ Delete 100,000 vectors for one user
→ May lock index
→ Impact query performance
→ Require maintenance window

Audit Trail

Proving Deletion:

GDPR requires proof:
→ Log deletion timestamp
→ Count vectors before/after
→ Store deletion certificate

Example log:
{
  user_id: "12345",
  deletion_requested: "2024-01-15T10:00:00Z",
  vectors_deleted: 15234,
  documents_deleted: 89,
  completed: "2024-01-15T10:15:23Z",
  verified_by: "admin@company.com"
}

How to Solve

Tag all vectors with user/document IDs at ingestion + implement metadata-based deletion (DELETE WHERE user_id=X) + perform semantic search for residual references + maintain audit log of deletions + consider re-indexing for guaranteed erasure + verify deletion with count queries. See GDPR Compliance.

Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the ask query parameter:

GET /dev/rag-scenarios-and-solutions/privacy/gdpr-compliance.md?ask=<question>

The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.

GDPR Right to Forget in Vector DB

Key Takeaways

The Problem

Symptoms

Real-World Example

Deep Technical Analysis

Vector-to-Source Mapping Problem

Deletion Strategies

Vector DB Capabilities

Audit Trail

How to Solve

Agent Instructions: Querying This Documentation

Related Pages

Integrations

Compliance

Investors

Industry