Rag Scenarios And Solutions
Embedding Service Privacy
Third-party embedding APIs (OpenAI, Cohere) process sensitive text to generate embeddings, creating privacy exposure during the embedding generation phase.
TL;DR
Third-party embedding APIs (OpenAI, Cohere) process sensitive text to generate embeddings, creating privacy exposure during the embedding generation phase.
Key Takeaways
- The Problem
- Deep Technical Analysis
- How to Solve
- Agent Instructions: Querying This Documentation
The Problem
Third-party embedding APIs (OpenAI, Cohere) process sensitive text to generate embeddings, creating privacy exposure during the embedding generation phase.
Symptoms
- ❌ Sensitive text sent to external API
- ❌ No control over embedding provider's data handling
- ❌ Cannot verify data deletion by provider
- ❌ Compliance risks with third-party processors
- ❌ Unclear data retention policies
Real-World Example
Company ingests confidential documents:
→ "Q4 Revenue: $500M (confidential)"
→ Sends to OpenAI Embeddings API
→ OpenAI processes text, returns vector
Privacy concerns:
→ OpenAI sees: "Q4 Revenue: $500M (confidential)"
→ Does OpenAI log this? (Enterprise: No, but trust required)
→ Does it train on it? (Enterprise: No per ToS)
→ Can we verify? (No direct audit capability)
→ If breached at OpenAI? (Data exposed)
Deep Technical Analysis
Third-Party Data Processing
API Request Flow:
Your system:
→ Raw text: "Confidential merger with AcmeCorp..."
HTTPS POST to api.openai.com/v1/embeddings:
{
"input": "Confidential merger with AcmeCorp...",
"model": "text-embedding-ada-002"
}
OpenAI servers:
→ Receive plaintext
→ Process through embedding model
→ Return vector
Risk: OpenAI sees raw sensitive text
Data Retention Policies:
OpenAI (Enterprise):
→ Zero data retention (claims)
→ Not used for training
→ Deleted after processing
Cohere:
→ Similar policies
But:
→ Must trust vendor
→ Cannot independently verify
→ Compliance auditors may not accept
Compliance Implications
GDPR Data Processors:
Embedding API = Data Processor:
→ Requires Data Processing Agreement (DPA)
→ Must follow GDPR obligations
→ Must have adequate security
Check vendor DPA:
→ OpenAI: Provides DPA
→ Cohere: Provides DPA
→ Verify coverage
BAA for HIPAA:
If embedding PHI:
→ Must have Business Associate Agreement
→ Not all vendors offer BAA
→ OpenAI: Enterprise only
→ Alternatives: Self-host
Industry-Specific:
Financial (PCI-DSS):
→ Cardholder data to third-party?
→ May violate PCI requirements
Defense (ITAR):
→ Controlled technical data cannot leave US
→ Cannot use cloud embedding APIs
→ Must self-host
Self-Hosted Alternatives
Open Source Embedding Models:
sentence-transformers:
→ all-MiniLM-L6-v2
→ all-mpnet-base-v2
→ Runs locally, no API call
Deployment:
→ Docker container
→ GPU optional (faster with GPU)
→ No data leaves infrastructure
Quality Trade-offs:
OpenAI text-embedding-ada-002:
→ 1536 dimensions
→ Very high quality
→ But: Cloud API
sentence-transformers/all-mpnet-base-v2:
→ 768 dimensions
→ Good quality (~90% of OpenAI)
→ Self-hostable
For sensitive data: Quality trade-off acceptable
Data Minimization
Pre-Processing:
Before sending to API:
→ Remove explicit PII (names, IDs)
→ Replace with tokens: "[NAME]", "[ID]"
→ Embed redacted text
Trade-off:
→ Reduced privacy risk
→ But: Semantic search less effective
→ "Find John Smith's email" won't work
How to Solve
For sensitive data: self-host embedding models (sentence-transformers) to avoid third-party exposure + if using APIs: execute DPA/BAA with provider + verify zero-retention policies + implement PII redaction before embedding + use enterprise API tiers with contractual protections + monitor vendor security posture. See Embedding Privacy.
Agent Instructions: Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.
Perform an HTTP GET request on the current page URL with the ask query parameter:
GET /dev/rag-scenarios-and-solutions/privacy/processor-compliance.md?ask=<question>
The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.
Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
Related Pages
Comparisons
Last updated January 26, 2026


