Rag Scenarios And Solutions
Cold Start Problem
New knowledge bases or freshly added documents perform poorly in retrieval because they lack query patterns, user feedback, and usage data to optimize results.
TL;DR
New knowledge bases or freshly added documents perform poorly in retrieval because they lack query patterns, user feedback, and usage data to optimize results.
Key Takeaways
- The Problem
- Deep Technical Analysis
- How to Solve
- Agent Instructions: Querying This Documentation
The Problem
New knowledge bases or freshly added documents perform poorly in retrieval because they lack query patterns, user feedback, and usage data to optimize results.
Symptoms
- ❌ First queries after setup return poor results
- ❌ New documents rank lower than older ones
- ❌ No personalization or optimization initially
- ❌ Quality improves slowly over weeks
- ❌ "Warm-up" period required
Real-World Example
Day 1: Add 1,000 documents to new knowledge base
First user query: "API authentication methods"
Result quality: 6/10
→ Generic semantic matching only
→ No understanding of which docs are most helpful
→ No query→document patterns learned
Day 30: After 500 queries
Same query: "API authentication methods"
Result quality: 9/10
→ System learned this query often needs OAuth guide
→ Certain docs consistently clicked
→ Ranking optimized based on feedback
Cold start = poor initial experience
Deep Technical Analysis
Zero-Shot Semantic Matching
Initial retrieval has no context:
Pure Embedding Similarity:
New knowledge base:
→ No query history
→ No click-through data
→ No document performance metrics
Retrieval purely based on:
→ Cosine similarity(query_embedding, doc_embedding)
→ No additional signals
→ No personalization
Works okay but not optimal
→ Semantic understanding from pre-trained model
→ But: Doesn't know YOUR domain patterns
Domain Adaptation Gap:
Pre-trained embedding model:
→ Trained on Wikipedia, books, web
→ General-purpose semantic understanding
Your specific domain:
→ "TPS report" (company-specific)
→ "GTM strategy" (internal acronym)
→ Product names, internal tools
Model has no domain knowledge
→ Treats as random strings
→ Poor retrieval for domain queries
Needs fine-tuning or time to adapt
Lack of Query→Document Patterns
No historical data to learn from:
User Behavior Signals (Missing):
Mature system knows:
→ Query "setup guide" → User clicks doc #3
→ Query "troubleshoot errors" → User clicks doc #7
→ Query "API limits" → User reads doc #12 fully
Can boost rankings:
→ Doc #3 ranks higher for "setup" queries
→ Doc #7 for "troubleshoot" queries
Cold start system:
→ No patterns learned
→ Cannot optimize rankings
→ Purely semantic matching
Query Reformulation Unknown:
Mature system observes:
→ User queries "how do I"
→ Gets poor results
→ Reformulates to "guide for"
→ Gets good results
Learning: "how do I" queries work better with "guide" docs
Cold start:
→ No reformulation patterns
→ Cannot suggest better queries
→ User struggles more
Document Quality Uncertainty
No implicit feedback signals:
Click-Through Rate (CTR) Unknown:
Mature system tracks:
→ Doc A: 45% CTR (frequently clicked)
→ Doc B: 8% CTR (rarely clicked)
Interpretation:
→ Doc A likely higher quality or better titled
→ Boost Doc A in rankings
Cold start:
→ All documents: 0% CTR (no data)
→ Cannot distinguish quality
→ Treat all equally
Dwell Time Not Measured:
Mature system:
→ Doc A: Average read time 2 minutes (users find answer quickly)
→ Doc B: Average read time 8 minutes (comprehensive, users read fully)
→ Doc C: Average read time 10 seconds (users bounce immediately)
Learning:
→ Doc C likely poor quality or misleading title
→ De-prioritize in rankings
Cold start:
→ No dwell time data
→ Cannot identify low-quality docs
No Personalization
User preferences unknown:
Individual User History:
Mature system per user:
→ User often queries about "API integration"
→ Rarely queries about "billing"
Personalization:
→ Boost API docs for this user
→ De-prioritize billing docs
Cold start:
→ No user history
→ Same results for everyone
→ Less relevant
Team/Organization Patterns:
Mature system for team:
→ Engineering team queries: 80% technical docs
→ Sales team queries: 70% pricing/features
Personalization:
→ Engineers see technical docs first
→ Sales sees business docs first
Cold start:
→ No team patterns
→ Everyone sees same results
Embedding Space Calibration
Vector similarities need calibration:
Score Distribution Unknown:
After 1000 queries, system learns:
→ Similarity > 0.85: Highly relevant (95% precision)
→ Similarity 0.75-0.85: Moderately relevant (70% precision)
→ Similarity < 0.75: Weakly relevant (30% precision)
Can set threshold: Only return > 0.75
Cold start:
→ Don't know score distribution
→ Is 0.80 good or bad for this domain?
→ Hard to set thresholds
Relative vs Absolute Scoring:
Some documents consistently score high:
→ "Getting Started" guide always 0.88+
→ Generic, matches many queries
Other docs score lower but more specific:
→ "Advanced Kubernetes Configuration" = 0.72
→ But perfect for that niche query
Mature system:
→ Adjusts for document-specific baselines
→ Penalizes generic docs
→ Rewards specific matches
Cold start:
→ Takes scores at face value
→ Generic docs dominate
Cold Start Mitigation Strategies
Techniques to improve initial quality:
1. Pre-Warming with Synthetic Queries:
Before launch:
1. Generate synthetic queries from documents
→ Extract titles: "API Authentication Guide"
→ Create query: "how to authenticate API"
2. Test retrieval quality
3. Identify poor-performing docs
4. Improve doc content or metadata
Provides baseline before real users
2. Import Historical Data:
If migrating from old system:
→ Export query logs
→ Import click-through data
→ Bootstrap new system with history
Advantages:
→ Immediate patterns
→ No true cold start
Limitations:
→ Old system may have different ranking
→ Historical data may be stale
3. Active Learning / Human Feedback:
During cold start period:
1. Flag uncertain results (borderline similarity)
2. Request human review: "Was this helpful?"
3. Use feedback to train quickly
4. Accelerate learning curve
Week 1: 50 labeled examples
→ Improves quality more than 500 unlabeled queries
4. Content-Based Features:
Instead of only query patterns, use:
→ Document metadata (date, author, type)
→ Document length (comprehensive vs brief)
→ Link graph (which docs reference which)
→ Section structure (well-organized?)
Signals available immediately
→ No user behavior needed
The Chicken-and-Egg Problem
Poor quality → Low usage → No data → Poor quality:
Vicious Cycle:
Day 1:
→ Search quality mediocre (cold start)
→ Users get frustrated
→ Users stop using system
→ No query data generated
→ Cannot improve
Stays stuck in cold start
Virtuous Cycle (if overcome):
Day 1:
→ Search quality okay (with mitigation)
→ Users somewhat satisfied
→ Users continue using
→ Query data accumulates
→ Quality improves
→ Users more satisfied
→ More usage
→ Better data
→ Higher quality
Positive feedback loop
Multi-Tenancy Cold Start
Each customer starts from zero:
Per-Customer Learning:
SaaS RAG platform:
→ Customer A: 1 year, 10K queries (mature)
→ Customer B: Just signed up (cold start)
Cannot share patterns:
→ Different domains
→ Different doc structure
→ Different user behavior
Customer B must learn independently
→ Weeks to reach Customer A quality
Cross-Customer Transfer Learning:
Potential optimization:
→ Learn general patterns across all customers
→ "query X → doc type Y" works broadly
→ Bootstrap new customers with general model
Challenges:
→ Privacy concerns (cross-customer data)
→ Domain differences
→ Limited effectiveness
Rarely implemented in practice
Temporal Cold Start
Knowledge base changes over time:
Content Refresh:
Mature knowledge base:
→ 1000 docs, well-optimized rankings
Add 100 new docs:
→ New content has no query history
→ Ranks poorly initially
→ "Cold start" for new docs only
Mixed state:
→ Old docs: Optimized
→ New docs: Cold start
→ Inconsistent quality
Concept Drift:
Year 1: Users query about "API v1"
Year 2: API v2 released
→ New queries about "API v2"
→ No historical patterns for v2
→ Must learn new query→document mappings
Even mature systems face cold starts
→ On new topics/features
How to Solve
Pre-warm with synthetic query generation + use content-based features (metadata, structure) immediately + implement explicit feedback collection ("Was this helpful?") + boost recently added documents temporarily + apply transfer learning from similar domains if available. See Cold Start Mitigation.
Agent Instructions: Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.
Perform an HTTP GET request on the current page URL with the ask query parameter:
GET /dev/rag-scenarios-and-solutions/vectors/cold-start.md?ask=<question>
The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.
Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
Related Pages
Integrations
Industries
Last updated January 26, 2026


