Rag Scenarios And Solutions
Character Encoding in Chunks
Text with special characters, emojis, or non-ASCII content breaks during embedding or retrieval, causing garbled text in AI responses.
TL;DR
Text with special characters, emojis, or non-ASCII content breaks during embedding or retrieval, causing garbled text in AI responses.
Key Takeaways
- The Problem
- Deep Technical Analysis
- How to Solve
- Agent Instructions: Querying This Documentation
The Problem
Text with special characters, emojis, or non-ASCII content breaks during embedding or retrieval, causing garbled text in AI responses.
Symptoms
- ❌ Foreign language text displays as "???" or boxes
- ❌ Emojis become broken characters
- ❌ Math symbols corrupted
- ❌ Smart quotes become weird chars
- ❌ Retrieval fails on non-ASCII queries
Real-World Example
Source document (UTF-8):
"Price: €500 for 10× improvement 🚀"
After ingestion (wrong encoding):
"Price: €500 for 10× improvement ?"
AI response cites corrupted text:
→ User sees garbage characters
→ Cannot understand pricing
→ Trust degraded
Deep Technical Analysis
Encoding Mismatches
UTF-8 vs Latin-1:
Document encoded as UTF-8:
→ "café" = [0xC3, 0xA9, 0x66, 0xC3, 0xA9]
Read as Latin-1 (wrong):
→ Interprets UTF-8 bytes as Latin-1
→ Displays: "café"
Must detect/declare encoding correctly
Windows-1252 vs UTF-8:
Smart quotes (Word docs):
→ " " (curly quotes)
→ Windows-1252 encoding
If treated as UTF-8:
→ Displays as � or ?
→ Common with Office doc imports
Embedding Model Limitations
Model Vocabulary:
Some embedding models:
→ Trained primarily on English ASCII
→ Limited non-ASCII support
→ May handle poorly:
- Chinese characters
- Arabic script
- Cyrillic
- Emoji
Result: Poor embeddings for non-English
Normalization:
Pre-process before embedding:
→ Convert smart quotes to straight quotes
→ " " → " "
→ – (en-dash) → - (hyphen)
Reduces encoding issues
But: Loses semantic nuance
Detection Strategies
Encoding Detection:
Use chardet library (Python):
→ Detects encoding probabilistically
→ "This looks like UTF-8 with 95% confidence"
Apply detected encoding:
→ Decode file correctly
→ Re-encode as UTF-8 standard
Prevents misinterpretation
Validation:
After ingestion, check:
→ Any � (replacement character)?
→ Excessive non-ASCII ranges?
→ Flag for review
Alerts to encoding problems
PDF Extraction Issues
OCR vs Native Text:
PDFs with scanned images:
→ OCR extracts text
→ OCR errors common:
- l vs I vs 1 (ambiguous)
- o vs 0
- Special chars misread
Native PDFs (better):
→ Embedded text preserved
→ Higher fidelity
Font Encoding:
Custom fonts in PDFs:
→ Character mapping non-standard
→ Extraction gives wrong characters
Example:
→ Displays "А" (Cyrillic A)
→ Extracts "A" (Latin A)
→ Looks same, semantically different
How to Solve
Detect encoding with chardet before processing + standardize to UTF-8 for all content + normalize problematic characters (smart quotes, dashes) + use multilingual embedding models for non-English + validate extracted text for replacement characters + prefer native text PDF over OCR when possible + test with multi-language eval set. See Character Encoding.
Agent Instructions: Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.
Perform an HTTP GET request on the current page URL with the ask query parameter:
GET /dev/rag-scenarios-and-solutions/data-quality/encoding-issues.md?ask=<question>
The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.
Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
Related Pages
Last updated January 26, 2026


