Missing Context in Images

The Problem

Images, diagrams, and screenshots in documents lack alt text or descriptions, making visual information inaccessible to RAG retrieval.

Symptoms

❌ "See diagram below" - AI can't see diagram
❌ Charts and graphs not described
❌ Architecture diagrams lost
❌ Screenshots provide no text
❌ Visual instructions unusable

Real-World Example

Documentation: "Follow these steps: [Screenshot of UI showing 5 buttons]"

Extracted text: "Follow these steps: [Image]"

Query: "How do I configure settings?"
→ Retrieved chunk mentions "follow steps"
→ But steps are in image (not extracted)

AI response: "The documentation mentions configuration steps but
doesn't provide details."

Visual info lost

Deep Technical Analysis

Image Extraction Challenges

Text Extraction from PDFs:

PDF contains:
→ Text (extractable)
→ Images (not automatically extracted)

Standard extraction:
→ Gets text: "See figure 3:"
→ Skips image content

Figure 3 has critical info:
→ Diagram of architecture
→ Flow chart of process
→ Lost in extraction

HTML Image Alt Text:

Good HTML:
<img src="diagram.png" alt="System architecture showing frontend, API, and database">
→ Alt text provides context

Bad HTML:
<img src="diagram.png" alt="image">
→ No useful context

Missing HTML:
<img src="diagram.png">
→ No alt text at all

Depends on source quality

OCR for Image Text

Embedded Text in Images:

Screenshot with text:
→ Button labels
→ Menu items
→ Error messages

Without OCR:
→ Text lost

With OCR (Tesseract, Cloud Vision API):
→ Extract text from image
→ Include in chunk content

Enables retrieval of visual text

OCR Limitations:

Works well for:
→ High-resolution screenshots
→ Clear typography
→ Good contrast

Fails for:
→ Handwriting
→ Low resolution
→ Complex backgrounds
→ Stylized fonts

Accuracy ~80-95% (varies)

Vision Language Models

Image Understanding:

Modern approach: Use vision-language models
→ GPT-4 Vision
→ CLIP
→ LLaVA (open source)

Process:
1. Extract image from document
2. Send to vision model
3. Prompt: "Describe this diagram in detail"
4. Model output: Text description
5. Embed description with document text

Makes visual content searchable

Cost Considerations:

GPT-4 Vision pricing:
→ $0.01-0.03 per image (depending on resolution)

Large knowledge base:
→ 10,000 images
→ Cost: $100-300

One-time cost at ingestion
Worth it for image-heavy docs

Multimodal Embeddings

CLIP Embeddings:

CLIP (OpenAI):
→ Embeds images and text in same space
→ "Cat photo" and actual cat photo = similar vectors

Use case:
→ Query: "Show me the authentication flow diagram"
→ Retrieves: Actual diagram image (embedded)
→ Can display image to user

Beyond just text retrieval

How to Solve

Extract alt text from images where available + implement OCR (Tesseract, Cloud Vision) for text in images + use vision-language models (GPT-4 Vision) to describe diagrams/charts + generate descriptive captions for images at ingestion + embed image descriptions as text chunks + consider multimodal embeddings (CLIP) for image-text search + tag chunks with "has_image" metadata for context. See Image Handling.

Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the ask query parameter:

GET /dev/rag-scenarios-and-solutions/data-quality/alt-text.md?ask=<question>

The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.

Missing Context in Images

Key Takeaways

The Problem

Symptoms

Real-World Example

Deep Technical Analysis

Image Extraction Challenges

OCR for Image Text

Vision Language Models

Multimodal Embeddings

How to Solve

Agent Instructions: Querying This Documentation

Related Pages

Integrations

Industries

Comparisons

Compliance

Investors

Industry