Rag Scenarios And Solutions
Private LLM for Sensitive Data
Organizations with highly sensitive data cannot use cloud LLM APIs due to data governance policies, requiring fully private inference infrastructure.
TL;DR
Organizations with highly sensitive data cannot use cloud LLM APIs due to data governance policies, requiring fully private inference infrastructure.
Key Takeaways
- The Problem
- Deep Technical Analysis
- How to Solve
- Agent Instructions: Querying This Documentation
The Problem
Organizations with highly sensitive data cannot use cloud LLM APIs due to data governance policies, requiring fully private inference infrastructure.
Symptoms
- ❌ Cloud APIs rejected by security
- ❌ Data cannot leave premises
- ❌ Need air-gapped deployment
- ❌ Compliance requires private models
- ❌ Cannot use OpenAI/Anthropic APIs
Real-World Example
Defense contractor builds RAG:
→ Knowledge base: Classified documents
→ Cannot send queries to OpenAI (cloud)
→ Data residency: Must stay on-premise
Requirements:
→ Self-hosted LLM
→ No internet connectivity
→ Full data sovereignty
→ Comparable performance to GPT-4
Deep Technical Analysis
Cloud API Privacy Concerns
Data Exposure:
Cloud LLM APIs:
→ Query sent over internet to vendor
→ Retrieved context included in request
→ Potentially logged for training/monitoring
→ Third-party processors see data
Even with enterprise agreements:
→ Some organizations cannot accept risk
→ Regulatory requirements (ITAR, FedRAMP)
→ Must use private models
Zero Data Retention Policies:
Some vendors offer:
→ OpenAI: Zero retention (Enterprise)
→ Anthropic: No training on customer data
But still:
→ Data in transit through vendor systems
→ Temporary processing exposure
→ Not acceptable for highest security tiers
Self-Hosted Model Options
Open Source LLMs:
Llama 2 (70B):
→ Quality ~GPT-3.5 level
→ Self-hostable
→ Free for commercial use
Mistral (7B/8x7B):
→ Strong performance
→ Efficient inference
Falcon (40B/180B):
→ Open weights
→ Competitive quality
Infrastructure Requirements:
Llama 2 70B:
→ 4x A100 GPUs (80GB each)
→ ~280GB VRAM total
→ Cost: ~$40K hardware
→ Or cloud GPU instances: $10-20/hour
For production:
→ Load balancing
→ Redundancy
→ Monitoring
→ DevOps overhead
Quantization Trade-offs:
Reduce model size:
→ 70B model at FP16: 140GB
→ 70B model at INT8: 70GB
→ 70B model at INT4: 35GB
Quality degradation:
→ FP16: 100% quality (baseline)
→ INT8: ~98% quality
→ INT4: ~92% quality
Fit on fewer GPUs but lower accuracy
Embedding Model Privacy
Self-Hosted Embeddings:
sentence-transformers (open source):
→ all-MiniLM-L6-v2: Fast, good quality
→ all-mpnet-base-v2: Higher quality
→ CPU or GPU inference
Runs locally:
→ No API calls
→ No data leakage
→ Full control
On-Device Embedding:
For edge deployment:
→ ONNX runtime
→ Quantized models
→ Inference on CPU
Enables:
→ Fully offline RAG
→ No network dependency
Air-Gapped Deployment
Disconnected Environment:
No internet access:
→ Download models beforehand
→ Transfer via physical media
→ Install in isolated network
Challenges:
→ Model updates: Manual process
→ No telemetry/monitoring (external)
→ Local logging only
Supply Chain Security:
Verify model provenance:
→ Check cryptographic signatures
→ Audit model weights (backdoors?)
→ Review training data sources
Open source advantage:
→ Weights inspectable
→ Community vetted
→ Reproducible builds
How to Solve
Deploy open-source LLMs (Llama 2 70B, Mistral) on-premise + use self-hosted embedding models (sentence-transformers) + implement quantization (INT8) to reduce hardware needs + set up air-gapped deployment for classified data + use vLLM or Text Generation Inference for efficient serving. See Private Models.
Agent Instructions: Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.
Perform an HTTP GET request on the current page URL with the ask query parameter:
GET /dev/rag-scenarios-and-solutions/privacy/private-models.md?ask=<question>
The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.
Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
Related Pages
Last updated January 26, 2026


