Rag Scenarios And Solutions
Missing Context in Images
Images, diagrams, and screenshots in documents lack alt text or descriptions, making visual information inaccessible to RAG retrieval.
TL;DR
Images, diagrams, and screenshots in documents lack alt text or descriptions, making visual information inaccessible to RAG retrieval.
Key Takeaways
- The Problem
- Deep Technical Analysis
- How to Solve
- Agent Instructions: Querying This Documentation
The Problem
Images, diagrams, and screenshots in documents lack alt text or descriptions, making visual information inaccessible to RAG retrieval.
Symptoms
- ❌ "See diagram below" - AI can't see diagram
- ❌ Charts and graphs not described
- ❌ Architecture diagrams lost
- ❌ Screenshots provide no text
- ❌ Visual instructions unusable
Real-World Example
Documentation: "Follow these steps: [Screenshot of UI showing 5 buttons]"
Extracted text: "Follow these steps: [Image]"
Query: "How do I configure settings?"
→ Retrieved chunk mentions "follow steps"
→ But steps are in image (not extracted)
AI response: "The documentation mentions configuration steps but
doesn't provide details."
Visual info lost
Deep Technical Analysis
Image Extraction Challenges
Text Extraction from PDFs:
PDF contains:
→ Text (extractable)
→ Images (not automatically extracted)
Standard extraction:
→ Gets text: "See figure 3:"
→ Skips image content
Figure 3 has critical info:
→ Diagram of architecture
→ Flow chart of process
→ Lost in extraction
HTML Image Alt Text:
Good HTML:
<img src="diagram.png" alt="System architecture showing frontend, API, and database">
→ Alt text provides context
Bad HTML:
<img src="diagram.png" alt="image">
→ No useful context
Missing HTML:
<img src="diagram.png">
→ No alt text at all
Depends on source quality
OCR for Image Text
Embedded Text in Images:
Screenshot with text:
→ Button labels
→ Menu items
→ Error messages
Without OCR:
→ Text lost
With OCR (Tesseract, Cloud Vision API):
→ Extract text from image
→ Include in chunk content
Enables retrieval of visual text
OCR Limitations:
Works well for:
→ High-resolution screenshots
→ Clear typography
→ Good contrast
Fails for:
→ Handwriting
→ Low resolution
→ Complex backgrounds
→ Stylized fonts
Accuracy ~80-95% (varies)
Vision Language Models
Image Understanding:
Modern approach: Use vision-language models
→ GPT-4 Vision
→ CLIP
→ LLaVA (open source)
Process:
1. Extract image from document
2. Send to vision model
3. Prompt: "Describe this diagram in detail"
4. Model output: Text description
5. Embed description with document text
Makes visual content searchable
Cost Considerations:
GPT-4 Vision pricing:
→ $0.01-0.03 per image (depending on resolution)
Large knowledge base:
→ 10,000 images
→ Cost: $100-300
One-time cost at ingestion
Worth it for image-heavy docs
Multimodal Embeddings
CLIP Embeddings:
CLIP (OpenAI):
→ Embeds images and text in same space
→ "Cat photo" and actual cat photo = similar vectors
Use case:
→ Query: "Show me the authentication flow diagram"
→ Retrieves: Actual diagram image (embedded)
→ Can display image to user
Beyond just text retrieval
How to Solve
Extract alt text from images where available + implement OCR (Tesseract, Cloud Vision) for text in images + use vision-language models (GPT-4 Vision) to describe diagrams/charts + generate descriptive captions for images at ingestion + embed image descriptions as text chunks + consider multimodal embeddings (CLIP) for image-text search + tag chunks with "has_image" metadata for context. See Image Handling.
Agent Instructions: Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.
Perform an HTTP GET request on the current page URL with the ask query parameter:
GET /dev/rag-scenarios-and-solutions/data-quality/alt-text.md?ask=<question>
The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.
Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
Related Pages
Last updated January 26, 2026


