Character Encoding in Chunks

The Problem

Text with special characters, emojis, or non-ASCII content breaks during embedding or retrieval, causing garbled text in AI responses.

Symptoms

❌ Foreign language text displays as "???" or boxes
❌ Emojis become broken characters
❌ Math symbols corrupted
❌ Smart quotes become weird chars
❌ Retrieval fails on non-ASCII queries

Real-World Example

Source document (UTF-8):
"Price: €500 for 10× improvement 🚀"

After ingestion (wrong encoding):
"Price: â‚¬500 for 10Ã— improvement ?"

AI response cites corrupted text:
→ User sees garbage characters
→ Cannot understand pricing
→ Trust degraded

Deep Technical Analysis

Encoding Mismatches

UTF-8 vs Latin-1:

Document encoded as UTF-8:
→ "café" = [0xC3, 0xA9, 0x66, 0xC3, 0xA9]

Read as Latin-1 (wrong):
→ Interprets UTF-8 bytes as Latin-1
→ Displays: "cafÃ©"

Must detect/declare encoding correctly

Windows-1252 vs UTF-8:

Smart quotes (Word docs):
→ " " (curly quotes)
→ Windows-1252 encoding

If treated as UTF-8:
→ Displays as � or ?
→ Common with Office doc imports

Embedding Model Limitations

Model Vocabulary:

Some embedding models:
→ Trained primarily on English ASCII
→ Limited non-ASCII support
→ May handle poorly:
  - Chinese characters
  - Arabic script
  - Cyrillic
  - Emoji

Result: Poor embeddings for non-English

Normalization:

Pre-process before embedding:
→ Convert smart quotes to straight quotes
→ " " → " "
→ – (en-dash) → - (hyphen)

Reduces encoding issues
But: Loses semantic nuance

Detection Strategies

Encoding Detection:

Use chardet library (Python):
→ Detects encoding probabilistically
→ "This looks like UTF-8 with 95% confidence"

Apply detected encoding:
→ Decode file correctly
→ Re-encode as UTF-8 standard

Prevents misinterpretation

Validation:

After ingestion, check:
→ Any � (replacement character)?
→ Excessive non-ASCII ranges?
→ Flag for review

Alerts to encoding problems

PDF Extraction Issues

OCR vs Native Text:

PDFs with scanned images:
→ OCR extracts text
→ OCR errors common:
  - l vs I vs 1 (ambiguous)
  - o vs 0
  - Special chars misread

Native PDFs (better):
→ Embedded text preserved
→ Higher fidelity

Font Encoding:

Custom fonts in PDFs:
→ Character mapping non-standard
→ Extraction gives wrong characters

Example:
→ Displays "А" (Cyrillic A)
→ Extracts "A" (Latin A)
→ Looks same, semantically different

How to Solve

Detect encoding with chardet before processing + standardize to UTF-8 for all content + normalize problematic characters (smart quotes, dashes) + use multilingual embedding models for non-English + validate extracted text for replacement characters + prefer native text PDF over OCR when possible + test with multi-language eval set. See Character Encoding.

Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the ask query parameter:

GET /dev/rag-scenarios-and-solutions/data-quality/encoding-issues.md?ask=<question>

The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.

Character Encoding in Chunks

Key Takeaways

The Problem

Symptoms

Real-World Example

Deep Technical Analysis

Encoding Mismatches

Embedding Model Limitations

Detection Strategies

PDF Extraction Issues

How to Solve

Agent Instructions: Querying This Documentation

Related Pages

Integrations

Industries

Comparisons

Compliance

Investors

Industry