Multilingual Embedding Issues

The Problem

Embedding models perform poorly across languages—English queries don't match non-English documents, translation loses meaning, and cross-lingual retrieval fails.

Symptoms

❌ Spanish query returns no results despite Spanish docs existing
❌ English query doesn't match French equivalent document
❌ Chinese characters embedded as gibberish
❌ Must maintain separate indexes per language
❌ Machine translation degrades quality

Real-World Example

Knowledge base contains:
→ 500 English docs
→ 200 Spanish docs
→ 100 French docs

User query (Spanish): "¿Cómo autenticar API?"
Translation: "How to authenticate API?"

Embedding model (English-only):
→ Embeds Spanish as unknown tokens
→ Poor semantic representation
→ Returns English docs (wrong language)
→ Misses Spanish "Guía de Autenticación" (perfect match!)

Result: User gets English docs they can't read

Deep Technical Analysis

Monolingual Model Limitations

English-trained models fail on other languages:

Vocabulary Coverage:

English tokenizer:
→ Trained on English text
→ Vocabulary: 50K English words/subwords

Spanish text: "autenticación"
→ Not in English vocabulary
→ Tokenized as: ["aut", "##ent", "##ic", "##aci", "##ón"]
→ 5 unknown subword pieces

vs English: "authentication"
→ Single known token
→ Proper semantic representation

Spanish embedding degraded by tokenization issues

Semantic Space Misalignment:

English model learns:
→ "dog" ≈ "puppy" ≈ "canine"

Doesn't learn:
→ "dog" ≈ "perro" (Spanish)
→ "dog" ≈ "chien" (French)
→ "dog" ≈ "犬" (Japanese)

Cross-lingual relationships missing
→ Cannot match concepts across languages

Translation-Based Approaches

Translating before embedding:

Query Translation:

Approach:
1. Detect query language
2. Translate to English (if not English)
3. Embed translated query
4. Search English-embedded docs

Problems:
→ Translation errors compound
→ "Bank" → "Banco" (financial) or "Orilla" (river)?
→ Context lost in translation
→ Idioms don't translate well
→ "It's raining cats and dogs" → ?

Document Translation:

Approach:
1. Translate all docs to English
2. Embed English versions
3. Store original + translation

Problems:
→ Expensive (translate 1000s of docs)
→ Translation quality varies
→ Loses original phrasing/nuance
→ Technical terms mistranslated
→ Updates require re-translation

The Round-Trip Problem:

Original Spanish: "Autenticación de dos factores"
→ Translate to English: "Two-factor authentication"
→ Embed English version
→ User queries in Spanish: "autenticación 2FA"
→ Translate to English: "2FA authentication"
→ Search → Match!

But:
Original Spanish: "Reiniciar contraseña"
→ Translate: "Restart password" (wrong!)
→ Should be: "Reset password"
→ User query: "reset password"
→ Translate to Spanish: "restablecer contraseña"
→ Embed → NO MATCH (reiniciar ≠ restablecer)

Translation errors break retrieval

Multilingual Embedding Models

Models trained on multiple languages:

Multilingual BERT (mBERT):

Trained on:
→ 104 languages simultaneously
→ Shared vocabulary across languages
→ Cross-lingual alignment

Benefits:
→ "dog" and "perro" have similar embeddings
→ Can match across languages

Limitations:
→ Lower quality than monolingual models
→ Diluted by 104 languages (each gets less attention)
→ Still biased toward high-resource languages (English)

Language-Specific Performance:

English: 92% accuracy (high-resource)
Spanish: 85% accuracy (medium-resource)
Vietnamese: 72% accuracy (low-resource)
Swahili: 58% accuracy (very low-resource)

Quality degrades for rare languages
→ Less training data available
→ Poorer representations

Code-Switching and Mixed Content

Documents mix languages:

Within-Document Language Mixing:

English doc with Spanish terms:
"Configure the autenticación de usuario in settings."

Or:
Technical doc with English API terms:
"Pour configurer l'API, utilisez authenticate() method."

Single language model struggles:
→ English model: "autenticación" tokenized badly
→ French model: "authenticate()" tokenized badly

Need model that handles mixed content

The Technical Term Problem:

Universal technical vocabulary:
→ "API", "database", "OAuth", "GitHub"
→ Used across all languages
→ Pronunciation may vary

French doc: "Utiliser l'API OAuth avec GitHub"
Spanish doc: "Usar la API OAuth con GitHub"
English doc: "Using the OAuth API with GitHub"

Technical terms should align across languages
→ But monolingual models don't ensure this

Character Encoding Issues

Non-Latin scripts have encoding problems:

Unicode Normalization:

Same character, different representations:
→ "é" = U+00E9 (single character)
→ "é" = U+0065 + U+0301 (e + combining acute)

Visually identical, different bytes
→ Different tokens
→ Different embeddings
→ Search fails

Must normalize before embedding:
→ NFC (composed) vs NFD (decomposed)
→ Consistent encoding required

Right-to-Left Languages:

Arabic, Hebrew:
→ Text flows right-to-left
→ Rendering direction
→ But stored left-to-right in memory

Embedding model expectations:
→ Trained on left-to-right text
→ May not handle RTL properly
→ Bidirectional text (mixed LTR/RTL) even worse

CJK (Chinese, Japanese, Korean):

No spaces between words:
→ "我喜欢编程" (Chinese: "I like programming")
→ 5 characters, 0 spaces

English tokenizer assumes spaces:
→ Treats each character separately
→ Loses word-level semantics

Need proper CJK word segmentation:
→ "我 喜欢 编程" (I / like / programming)
→ Proper tokenization

Language Detection Challenges

Determining document/query language:

Automatic Detection:

Tools: langdetect, langid, fastText

Short text problems:
→ "API key" (English or universal?)
→ "OK" (English, Spanish, many others)
→ "Taxi" (English, Spanish, French, etc.)

Cannot reliably detect with <5 words
→ Default to English?
→ Try multiple languages?
→ Ask user?

Mixed Language Documents:

Document:
"Introduction [English]
Chapitre 1: Installation [French]
Chapter 2: Configuration [English]
Capítulo 3: Troubleshooting [Spanish]"

What is the document's language?
→ Multi-language
→ Predominant language: English (50%)
→ But important content in others

How to embed?
→ Per-section with language tags?
→ As single multilingual embedding?
→ Multiple embeddings per doc?

Cross-Lingual Search Strategies

Retrieval across language boundaries:

Approach 1: Separate Indexes Per Language:

English index: English docs
Spanish index: Spanish docs
French index: French docs

Query in Spanish:
→ Search Spanish index only
→ Get Spanish results
→ Fast, simple

Limitations:
→ Cannot find related English docs
→ User might benefit from English docs too
→ Knowledge siloed by language

Approach 2: Unified Multilingual Index:

Single index with multilingual embeddings:
→ All docs regardless of language
→ Cross-lingual retrieval possible

Query in Spanish:
→ Retrieve Spanish docs (highest similarity)
→ Also retrieve English docs (lower similarity, but relevant)

User can see both:
→ Preferred language first
→ Other languages as fallback

Hybrid Approach:

Metadata filtering + multilingual search:
1. Detect query language
2. Boost docs in same language (2x multiplier)
3. But still include other languages
4. Present results with language tags

Best of both:
→ Preferred language prioritized
→ Other languages accessible
→ User choice

How to Solve

Use multilingual embedding models (mBERT, XLM-RoBERTa) + normalize Unicode encoding (NFC) + implement language detection + boost same-language results but allow cross-lingual retrieval + store language metadata + consider per-section embeddings for mixed-language docs. See Multilingual Search.

Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the ask query parameter:

GET /dev/rag-scenarios-and-solutions/vectors/multilingual-embeddings.md?ask=<question>

The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.

Multilingual Embedding Issues

Key Takeaways

The Problem

Symptoms

Real-World Example

Deep Technical Analysis

Monolingual Model Limitations

Translation-Based Approaches

Multilingual Embedding Models

Code-Switching and Mixed Content

Character Encoding Issues

Language Detection Challenges

Cross-Lingual Search Strategies

How to Solve

Agent Instructions: Querying This Documentation

Related Pages

Integrations

Industries

Compliance

Investors

Industry