Rag Scenarios And Solutions
Multilingual Embedding Issues
Embedding models perform poorly across languages—English queries don't match non-English documents, translation loses meaning, and cross-lingual retrieval fails.
TL;DR
Embedding models perform poorly across languages—English queries don't match non-English documents, translation loses meaning, and cross-lingual retrieval fails.
Key Takeaways
- The Problem
- Deep Technical Analysis
- How to Solve
- Agent Instructions: Querying This Documentation
The Problem
Embedding models perform poorly across languages—English queries don't match non-English documents, translation loses meaning, and cross-lingual retrieval fails.
Symptoms
- ❌ Spanish query returns no results despite Spanish docs existing
- ❌ English query doesn't match French equivalent document
- ❌ Chinese characters embedded as gibberish
- ❌ Must maintain separate indexes per language
- ❌ Machine translation degrades quality
Real-World Example
Knowledge base contains:
→ 500 English docs
→ 200 Spanish docs
→ 100 French docs
User query (Spanish): "¿Cómo autenticar API?"
Translation: "How to authenticate API?"
Embedding model (English-only):
→ Embeds Spanish as unknown tokens
→ Poor semantic representation
→ Returns English docs (wrong language)
→ Misses Spanish "Guía de Autenticación" (perfect match!)
Result: User gets English docs they can't read
Deep Technical Analysis
Monolingual Model Limitations
English-trained models fail on other languages:
Vocabulary Coverage:
English tokenizer:
→ Trained on English text
→ Vocabulary: 50K English words/subwords
Spanish text: "autenticación"
→ Not in English vocabulary
→ Tokenized as: ["aut", "##ent", "##ic", "##aci", "##ón"]
→ 5 unknown subword pieces
vs English: "authentication"
→ Single known token
→ Proper semantic representation
Spanish embedding degraded by tokenization issues
Semantic Space Misalignment:
English model learns:
→ "dog" ≈ "puppy" ≈ "canine"
Doesn't learn:
→ "dog" ≈ "perro" (Spanish)
→ "dog" ≈ "chien" (French)
→ "dog" ≈ "犬" (Japanese)
Cross-lingual relationships missing
→ Cannot match concepts across languages
Translation-Based Approaches
Translating before embedding:
Query Translation:
Approach:
1. Detect query language
2. Translate to English (if not English)
3. Embed translated query
4. Search English-embedded docs
Problems:
→ Translation errors compound
→ "Bank" → "Banco" (financial) or "Orilla" (river)?
→ Context lost in translation
→ Idioms don't translate well
→ "It's raining cats and dogs" → ?
Document Translation:
Approach:
1. Translate all docs to English
2. Embed English versions
3. Store original + translation
Problems:
→ Expensive (translate 1000s of docs)
→ Translation quality varies
→ Loses original phrasing/nuance
→ Technical terms mistranslated
→ Updates require re-translation
The Round-Trip Problem:
Original Spanish: "Autenticación de dos factores"
→ Translate to English: "Two-factor authentication"
→ Embed English version
→ User queries in Spanish: "autenticación 2FA"
→ Translate to English: "2FA authentication"
→ Search → Match!
But:
Original Spanish: "Reiniciar contraseña"
→ Translate: "Restart password" (wrong!)
→ Should be: "Reset password"
→ User query: "reset password"
→ Translate to Spanish: "restablecer contraseña"
→ Embed → NO MATCH (reiniciar ≠ restablecer)
Translation errors break retrieval
Multilingual Embedding Models
Models trained on multiple languages:
Multilingual BERT (mBERT):
Trained on:
→ 104 languages simultaneously
→ Shared vocabulary across languages
→ Cross-lingual alignment
Benefits:
→ "dog" and "perro" have similar embeddings
→ Can match across languages
Limitations:
→ Lower quality than monolingual models
→ Diluted by 104 languages (each gets less attention)
→ Still biased toward high-resource languages (English)
Language-Specific Performance:
English: 92% accuracy (high-resource)
Spanish: 85% accuracy (medium-resource)
Vietnamese: 72% accuracy (low-resource)
Swahili: 58% accuracy (very low-resource)
Quality degrades for rare languages
→ Less training data available
→ Poorer representations
Code-Switching and Mixed Content
Documents mix languages:
Within-Document Language Mixing:
English doc with Spanish terms:
"Configure the autenticación de usuario in settings."
Or:
Technical doc with English API terms:
"Pour configurer l'API, utilisez authenticate() method."
Single language model struggles:
→ English model: "autenticación" tokenized badly
→ French model: "authenticate()" tokenized badly
Need model that handles mixed content
The Technical Term Problem:
Universal technical vocabulary:
→ "API", "database", "OAuth", "GitHub"
→ Used across all languages
→ Pronunciation may vary
French doc: "Utiliser l'API OAuth avec GitHub"
Spanish doc: "Usar la API OAuth con GitHub"
English doc: "Using the OAuth API with GitHub"
Technical terms should align across languages
→ But monolingual models don't ensure this
Character Encoding Issues
Non-Latin scripts have encoding problems:
Unicode Normalization:
Same character, different representations:
→ "é" = U+00E9 (single character)
→ "é" = U+0065 + U+0301 (e + combining acute)
Visually identical, different bytes
→ Different tokens
→ Different embeddings
→ Search fails
Must normalize before embedding:
→ NFC (composed) vs NFD (decomposed)
→ Consistent encoding required
Right-to-Left Languages:
Arabic, Hebrew:
→ Text flows right-to-left
→ Rendering direction
→ But stored left-to-right in memory
Embedding model expectations:
→ Trained on left-to-right text
→ May not handle RTL properly
→ Bidirectional text (mixed LTR/RTL) even worse
CJK (Chinese, Japanese, Korean):
No spaces between words:
→ "我喜欢编程" (Chinese: "I like programming")
→ 5 characters, 0 spaces
English tokenizer assumes spaces:
→ Treats each character separately
→ Loses word-level semantics
Need proper CJK word segmentation:
→ "我 喜欢 编程" (I / like / programming)
→ Proper tokenization
Language Detection Challenges
Determining document/query language:
Automatic Detection:
Tools: langdetect, langid, fastText
Short text problems:
→ "API key" (English or universal?)
→ "OK" (English, Spanish, many others)
→ "Taxi" (English, Spanish, French, etc.)
Cannot reliably detect with <5 words
→ Default to English?
→ Try multiple languages?
→ Ask user?
Mixed Language Documents:
Document:
"Introduction [English]
Chapitre 1: Installation [French]
Chapter 2: Configuration [English]
Capítulo 3: Troubleshooting [Spanish]"
What is the document's language?
→ Multi-language
→ Predominant language: English (50%)
→ But important content in others
How to embed?
→ Per-section with language tags?
→ As single multilingual embedding?
→ Multiple embeddings per doc?
Cross-Lingual Search Strategies
Retrieval across language boundaries:
Approach 1: Separate Indexes Per Language:
English index: English docs
Spanish index: Spanish docs
French index: French docs
Query in Spanish:
→ Search Spanish index only
→ Get Spanish results
→ Fast, simple
Limitations:
→ Cannot find related English docs
→ User might benefit from English docs too
→ Knowledge siloed by language
Approach 2: Unified Multilingual Index:
Single index with multilingual embeddings:
→ All docs regardless of language
→ Cross-lingual retrieval possible
Query in Spanish:
→ Retrieve Spanish docs (highest similarity)
→ Also retrieve English docs (lower similarity, but relevant)
User can see both:
→ Preferred language first
→ Other languages as fallback
Hybrid Approach:
Metadata filtering + multilingual search:
1. Detect query language
2. Boost docs in same language (2x multiplier)
3. But still include other languages
4. Present results with language tags
Best of both:
→ Preferred language prioritized
→ Other languages accessible
→ User choice
How to Solve
Use multilingual embedding models (mBERT, XLM-RoBERTa) + normalize Unicode encoding (NFC) + implement language detection + boost same-language results but allow cross-lingual retrieval + store language metadata + consider per-section embeddings for mixed-language docs. See Multilingual Search.
Agent Instructions: Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.
Perform an HTTP GET request on the current page URL with the ask query parameter:
GET /dev/rag-scenarios-and-solutions/vectors/multilingual-embeddings.md?ask=<question>
The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.
Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
Related Pages
Integrations
Industries
Last updated January 26, 2026


