Poor Semantic Search Results

The Problem

Queries return irrelevant documents, miss obviously related content, or surface documents that don't semantically match the user's intent.

Symptoms

❌ Search for "authentication" returns "authorization" docs only
❌ Query "how to debug errors" returns pricing pages
❌ Exact keyword match ranks lower than unrelated docs
❌ Synonyms not recognized ("car" doesn't match "automobile")
❌ User frustrated with search quality

Real-World Example

Query: "How do I reset my password?"

Top results returned:
1. "Account Security Best Practices" (score: 0.78)
2. "Password Requirements Policy" (score: 0.76)
3. "Two-Factor Authentication Setup" (score: 0.74)

Missing from results:
- "Password Reset Guide" (score: 0.68) ← Should be #1!

Problem: Semantic similarity favors "password" + "security" 
over actual "password reset" procedure

Deep Technical Analysis

Embedding Space Limitations

Vector embeddings have inherent constraints:

Dimensionality and Information Loss:

Document: 2,000 tokens (rich information)
↓ Embedding model
Embedding: 1,536 dimensions (compressed)

Compression ratio: 2000:1536
→ Information loss inevitable
→ Nuances collapsed
→ Subtle differences erased

Two documents:
1. "How to reset password"
2. "How to change password"

May have nearly identical embeddings
→ Both about password modification
→ Semantic difference ("reset" vs "change") lost
→ Retrieved interchangeably

The Polysemy Problem:

Word: "bank"
Meanings:
1. Financial institution
2. River bank
3. Blood bank
4. Memory bank

Single embedding for "bank":
→ Averages all contexts from training data
→ No single meaning dominant
→ Query "bank account" may retrieve river banks

Contextual embeddings help but don't eliminate issue

Query-Document Mismatch

User queries differ from document language:

Vocabulary Gap:

User query: "My app crashed"
Document title: "Application Failure Troubleshooting"

Terms:
→ "app" vs "application"
→ "crashed" vs "failure"

Embeddings may not capture equivalence
→ Trained on formal text
→ User uses casual language
→ Semantic gap

Better if document also includes:
"If your app crashes or fails..."
→ Contains both formal and casual terms

Question-Answer Asymmetry:

User query (question form):
"How do I authenticate with OAuth?"

Document content (declarative):
"OAuth authentication requires three steps: ..."

Query embedding: encodes interrogative structure
Document embedding: encodes declarative structure

Similarity may be lower than expected
→ Different sentence structures
→ Despite identical topic

Training Data Bias

Embedding models reflect training corpus:

Domain Specificity:

General embedding model trained on:
→ Wikipedia, books, web pages
→ Broad general knowledge
→ Limited technical depth

Company's specialized domain:
→ Proprietary API terms
→ Internal acronyms (TPS, GTM, CRM)
→ Product-specific jargon

Model doesn't "understand" domain terms
→ Treats as arbitrary strings
→ Poor retrieval for specialized queries

Temporal Bias:

Model trained in 2021:
→ "GPT" associated with "generative pre-trained"
→ Weak association with "chatbot"

Query in 2024: "GPT chatbot features"
→ Model's understanding outdated
→ Doesn't reflect current usage
→ Retrieval suboptimal

Cosine Similarity Limitations

Similarity metric has blind spots:

Magnitude vs Direction:

Cosine similarity: cos(θ) = A·B / (||A|| ||B||)
→ Measures angle, not magnitude
→ Two vectors same direction: high similarity
→ Regardless of vector length

Document A: Short, focused (magnitude: 0.5)
Document B: Long, comprehensive (magnitude: 2.0)

If same direction: cosine similarity = 1.0
→ Treats equally
→ But document B has 4x more information
→ May be more valuable

Alternative: Euclidean distance
→ Considers magnitude
→ But less common in RAG systems

The Hubness Problem:

In high-dimensional spaces:
→ Some vectors become "hubs"
→ Close to many other vectors
→ Retrieved disproportionately often

Example:
Document about "getting started":
→ Generic, broad topic
→ Embedding is central in vector space
→ Matches many queries
→ Always in top-10 results
→ Crowds out more specific, relevant docs

Mitigation: Hubness reduction algorithms
→ Rare in production RAG systems

Negative Retrieval and Exclusions

Semantic search struggles with negation:

The NOT Problem:

Query: "authentication WITHOUT OAuth"
User wants: API key, JWT, Basic auth
Excludes: OAuth methods

Embedding of "authentication WITHOUT OAuth":
→ Still contains "OAuth" token
→ High similarity to OAuth docs
→ Returns exactly what user wanted to avoid

Semantic search is inclusive, not exclusive
→ Cannot filter out concepts
→ Treats "WITHOUT" as just another word

Contrastive Queries:

Query: "Differences between REST and GraphQL"

Ideal results:
→ Comparison documents
→ "REST vs GraphQL" articles

Actual results:
→ Mix of REST docs and GraphQL docs
→ High similarity to both concepts
→ But no direct comparison
→ User must synthesize themselves

Retrieval K Parameter Tuning

Top-K selection affects quality:

Too Small K:

K=3 (retrieve top 3 documents)

Scenario:
→ Top 3 all about API authentication
→ Query also needs rate limiting info
→ Rate limit doc ranked #4
→ Excluded from context
→ LLM can't answer rate limit questions

Narrow context, incomplete coverage

Too Large K:

K=20 (retrieve top 20 documents)

Scenario:
→ Top 3 highly relevant
→ Ranks 4-20 marginally relevant
→ Fill context window
→ Dilute signal with noise
→ LLM distracted by irrelevant info

Broad context, reduced precision

Dynamic K:

Adaptive approach:
1. Retrieve top K=50 candidates
2. Apply similarity threshold (e.g., >0.75)
3. Return only above threshold
4. If <3 results: Lower threshold
5. If >15 results: Raise threshold

Adjusts K based on query specificity
→ Broad query: More results
→ Specific query: Fewer, higher-quality results

Reranking and Two-Stage Retrieval

Initial retrieval may need refinement:

The Speed-Accuracy Trade-off:

Stage 1: Fast embedding retrieval
→ Retrieve top 50 candidates
→ ~100ms latency
→ Decent recall, imperfect precision

Stage 2: Slow reranking
→ Cross-encoder model
→ Score each of 50 candidates
→ +500ms latency
→ Excellent precision

Total: 600ms
→ Better quality
→ Higher cost
→ User-noticeable delay

Skip reranking for speed?
→ 100ms latency
→ Lower quality
→ Users frustrated with bad results

Trade-off: Speed vs quality

Reranker Model Selection:

Options:
1. Same embedding model (redundant)
2. Different embedding model (limited gain)
3. Cross-encoder (best, but slowest)
4. LLM-based scoring (expensive)

Cross-encoder:
→ Processes query + document together
→ Outputs relevance score
→ More accurate than separate embeddings
→ But: O(n) for n documents
→ Embedding: O(1) lookup after indexing

Reranking all 50: 50 forward passes
→ GPU/CPU intensive

Metadata Filtering vs Semantic Search

Combining structured and unstructured queries:

Hybrid Queries:

User intent: "Recent API docs"
→ Semantic: "API documentation"
→ Filter: date > 2024-01-01

Two-stage approach:
1. Filter by metadata (date)
2. Semantic search within filtered set

Or:
1. Semantic search (get top 100)
2. Filter results by metadata
3. Return top-K after filtering

Different orderings yield different results

The Filter Cardinality Problem:

Scenario:
→ 100,000 total documents
→ User queries: "API docs in Python"
→ Metadata filter: language="Python"
→ Only 500 Python docs exist

Option A: Filter first
→ Search within 500 docs
→ Fast, but limited pool
→ May miss relevant non-Python docs that are instructive

Option B: Search first
→ Get top 100 from all 100K
→ Only 5 are Python docs
→ User wanted more Python examples

Optimal: Depends on user intent

Cold Start and New Documents

Recently added content underperforms:

The Freshness Problem:

New document added today:
→ Perfect answer to user query
→ But: Ranks #47 in results

Why?
→ No user interaction history
→ No click-through data
→ No implicit feedback
→ Pure semantic similarity

Older documents:
→ Have been refined based on user feedback
→ Keywords optimized
→ Slight advantage in ranking

New doc needs time to "prove" itself

Implicit Boosting:

Document clicked frequently:
→ Indicates user satisfaction
→ Should rank higher for similar queries

But pure semantic search:
→ No feedback loop
→ Each query independent
→ Ignores user behavior

Learning-to-rank systems:
→ Incorporate click-through rate
→ Adjust scores based on engagement
→ But: Complex infrastructure
→ Rare in simple RAG systems

How to Solve

Fine-tune embeddings on domain-specific data + implement two-stage retrieval with reranking + use hybrid search (semantic + keyword) + adjust K dynamically based on score distribution + add metadata filtering + boost recently updated documents. See Search Quality Optimization.

Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the ask query parameter:

GET /dev/rag-scenarios-and-solutions/vectors/poor-search-results.md?ask=<question>

The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.

Poor Semantic Search Results

Key Takeaways

The Problem

Symptoms

Real-World Example

Deep Technical Analysis

Embedding Space Limitations

Query-Document Mismatch

Training Data Bias

Cosine Similarity Limitations

Negative Retrieval and Exclusions

Retrieval K Parameter Tuning

Reranking and Two-Stage Retrieval

Metadata Filtering vs Semantic Search

Cold Start and New Documents

How to Solve

Agent Instructions: Querying This Documentation

Related Pages

Integrations

Industries

Comparisons

Compliance

Investors

Industry