Rag Scenarios And Solutions

Embedding Service Privacy

Third-party embedding APIs (OpenAI, Cohere) process sensitive text to generate embeddings, creating privacy exposure during the embedding generation phase.

TL;DR

Third-party embedding APIs (OpenAI, Cohere) process sensitive text to generate embeddings, creating privacy exposure during the embedding generation phase.

Key Takeaways

  • The Problem
  • Deep Technical Analysis
  • How to Solve
  • Agent Instructions: Querying This Documentation

The Problem

Third-party embedding APIs (OpenAI, Cohere) process sensitive text to generate embeddings, creating privacy exposure during the embedding generation phase.

Symptoms

  • ❌ Sensitive text sent to external API
  • ❌ No control over embedding provider's data handling
  • ❌ Cannot verify data deletion by provider
  • ❌ Compliance risks with third-party processors
  • ❌ Unclear data retention policies

Real-World Example

Company ingests confidential documents:
→ "Q4 Revenue: $500M (confidential)"
→ Sends to OpenAI Embeddings API
→ OpenAI processes text, returns vector

Privacy concerns:
→ OpenAI sees: "Q4 Revenue: $500M (confidential)"
→ Does OpenAI log this? (Enterprise: No, but trust required)
→ Does it train on it? (Enterprise: No per ToS)
→ Can we verify? (No direct audit capability)
→ If breached at OpenAI? (Data exposed)

Deep Technical Analysis

Third-Party Data Processing

API Request Flow:

Your system:
→ Raw text: "Confidential merger with AcmeCorp..."

HTTPS POST to api.openai.com/v1/embeddings:
{
  "input": "Confidential merger with AcmeCorp...",
  "model": "text-embedding-ada-002"
}

OpenAI servers:
→ Receive plaintext
→ Process through embedding model
→ Return vector

Risk: OpenAI sees raw sensitive text

Data Retention Policies:

OpenAI (Enterprise):
→ Zero data retention (claims)
→ Not used for training
→ Deleted after processing

Cohere:
→ Similar policies

But:
→ Must trust vendor
→ Cannot independently verify
→ Compliance auditors may not accept

Compliance Implications

GDPR Data Processors:

Embedding API = Data Processor:
→ Requires Data Processing Agreement (DPA)
→ Must follow GDPR obligations
→ Must have adequate security

Check vendor DPA:
→ OpenAI: Provides DPA
→ Cohere: Provides DPA
→ Verify coverage

BAA for HIPAA:

If embedding PHI:
→ Must have Business Associate Agreement
→ Not all vendors offer BAA
→ OpenAI: Enterprise only
→ Alternatives: Self-host

Industry-Specific:

Financial (PCI-DSS):
→ Cardholder data to third-party?
→ May violate PCI requirements

Defense (ITAR):
→ Controlled technical data cannot leave US
→ Cannot use cloud embedding APIs
→ Must self-host

Self-Hosted Alternatives

Open Source Embedding Models:

sentence-transformers:
→ all-MiniLM-L6-v2
→ all-mpnet-base-v2
→ Runs locally, no API call

Deployment:
→ Docker container
→ GPU optional (faster with GPU)
→ No data leaves infrastructure

Quality Trade-offs:

OpenAI text-embedding-ada-002:
→ 1536 dimensions
→ Very high quality
→ But: Cloud API

sentence-transformers/all-mpnet-base-v2:
→ 768 dimensions
→ Good quality (~90% of OpenAI)
→ Self-hostable

For sensitive data: Quality trade-off acceptable

Data Minimization

Pre-Processing:

Before sending to API:
→ Remove explicit PII (names, IDs)
→ Replace with tokens: "[NAME]", "[ID]"
→ Embed redacted text

Trade-off:
→ Reduced privacy risk
→ But: Semantic search less effective
→ "Find John Smith's email" won't work

How to Solve

For sensitive data: self-host embedding models (sentence-transformers) to avoid third-party exposure + if using APIs: execute DPA/BAA with provider + verify zero-retention policies + implement PII redaction before embedding + use enterprise API tiers with contractual protections + monitor vendor security posture. See Embedding Privacy.


Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the ask query parameter:

GET /dev/rag-scenarios-and-solutions/privacy/processor-compliance.md?ask=<question>

The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.

Related Pages

Last updated January 26, 2026