Rag Scenarios And Solutions
HTML to Text Conversion Problems
Converting HTML documents to plain text loses structure, formatting, navigation elements contaminate content, and JavaScript-rendered content is missed entirely.
TL;DR
Converting HTML documents to plain text loses structure, formatting, navigation elements contaminate content, and JavaScript-rendered content is missed entirely.
Key Takeaways
- The Problem
- Deep Technical Analysis
- How to Solve
- Agent Instructions: Querying This Documentation
The Problem
Converting HTML documents to plain text loses structure, formatting, navigation elements contaminate content, and JavaScript-rendered content is missed entirely.
Symptoms
- ❌ Navigation menus mixed into article text
- ❌ "Click here" buttons appear as plain text
- ❌ CSS-hidden content extracted (e.g., mobile menus)
- ❌
<div>soup with no semantic structure - ❌ Ads and tracking scripts in extracted text
Real-World Example
<html>
<header>
<nav>Home | About | Products | Contact</nav>
</header>
<main>
<article>
<h1>Getting Started Guide</h1>
<p>Welcome to our platform...</p>
</article>
</main>
<footer>© 2024 Company | Privacy | Terms</footer>
</html>
Naive text extraction:
"Home About Products Contact Getting Started Guide Welcome to our platform... © 2024 Company Privacy Terms"
All elements flattened, navigation mixed with content
Deep Technical Analysis
Semantic HTML vs Div Soup
Modern HTML uses semantic tags:
Semantic HTML5:
<article>: Main content
<nav>: Navigation
<header>: Page/section header
<footer>: Page/section footer
<aside>: Sidebar/tangential content
<main>: Primary content
Best Case Extraction:
<main>
<article>
<p>Content to extract</p>
</article>
</main>
<nav>Menu items (ignore)</nav>
Algorithm:
1. Find <article> or <main>
2. Extract only from these tags
3. Ignore <nav>, <header>, <footer>
Works well for semantic HTML
Worst Case (Div Soup):
<div class="content">
<div class="header">Menu</div>
<div class="main">
<div class="article">Actual content</div>
</div>
<div class="sidebar">Ads</div>
</div>
No semantic tags, only <div> with classes
→ Must infer content from class names
→ "content", "main", "article" are hints
→ But no standards, site-specific
CSS Display and Visibility
HTML content may be visually hidden:
Display: None:
<div style="display:none">
Mobile menu (hidden on desktop)
</div>
Extraction Issue:
Text extraction sees:
→ "Mobile menu (hidden on desktop)"
→ Even though visually hidden
Should extracted text:
→ Include hidden content? (it exists in DOM)
→ Exclude hidden content? (user doesn't see it)
Use case dependent:
→ Mobile menu: Exclude (duplicate of main nav)
→ Spoiler/accordion: Include (real content, just collapsed)
Visibility: Hidden vs Opacity:0:
<div style="visibility:hidden">Content A</div>
<div style="opacity:0">Content B</div>
<div class="sr-only">Screen reader only text</div>
All invisible to sighted users
→ visibility:hidden: Layout space reserved
→ opacity:0: Transparent but present
→ sr-only: For accessibility
Should extract:
→ sr-only: Yes (valuable alt text)
→ others: Debatable
Navigation and UI Elements
Page chrome contamination:
Navigation Extraction:
<nav>
<ul>
<li><a href="/">Home</a></li>
<li><a href="/about">About</a></li>
<li><a href="/products">Products</a></li>
</ul>
</nav>
Text extraction: "Home About Products"
Appears in every page:
→ 50 pages on site
→ All have "Home About Products" in extracted text
→ Repetitive noise
→ Dilutes unique content signal
Button and Link Text:
<button>Click here</button>
<a href="/signup">Learn more</a>
<a href="/docs">Read documentation →</a>
Extracted: "Click here Learn more Read documentation →"
Out of context:
→ "Click here" meaningless (click where?)
→ "Learn more" vague (learn about what?)
→ Arrow "→" is decorative (visual only)
Better to:
→ Extract link destination as context
→ "Sign up (Learn more)"
→ "Documentation (Read documentation)"
Forms and Input Fields
Form elements have special extraction needs:
Form HTML:
<form>
<label for="email">Email:</label>
<input type="email" id="email" placeholder="you@example.com">
<button>Submit</button>
</form>
Extraction Variants:
Option 1: Extract labels only
"Email: Submit"
Option 2: Extract labels + placeholders
"Email: you@example.com Submit"
→ Placeholder looks like content (wrong)
Option 3: Skip forms entirely
(no text extracted)
Best practice:
→ Extract labels (field names)
→ Skip inputs, buttons, placeholders
→ "Email:" is content
→ "you@example.com" is just hint
Script Tags and Style Blocks
Non-content elements:
JavaScript Inline:
<script>
function trackEvent() {
analytics.send('pageview');
}
</script>
Text Extraction:
Naive extraction includes:
"function trackEvent() { analytics.send('pageview'); }"
This is code, not content!
→ Should be excluded
→ But: How to distinguish from <code> blocks (which should be included)?
Solution:
→ Strip <script> tags entirely
→ Keep <code> and <pre> tags
CSS Inline:
<style>
.header { color: blue; font-size: 24px; }
</style>
Also not content
→ Should exclude
→ Most parsers do this automatically
Generated Content (CSS ::before/::after)
CSS can inject text:
Pseudo-Elements:
.warning::before {
content: "⚠️ Warning: ";
}
<div class="warning">System maintenance tonight</div>
Visual Rendering:
⚠️ Warning: System maintenance tonight
Text Extraction:
HTML DOM only contains:
"System maintenance tonight"
Missing: "⚠️ Warning: "
→ Generated by CSS, not in DOM
→ Text extraction sees incomplete sentence
Headless browser rendering needed:
→ Render page with CSS
→ Extract computed text (including ::before/::after)
→ More accurate but much slower
Table Extraction from HTML
HTML tables need structure preservation:
Table HTML:
<table>
<thead>
<tr><th>Product</th><th>Price</th></tr>
</thead>
<tbody>
<tr><td>Widget</td><td>$10</td></tr>
</tbody>
</table>
Extraction Formats:
Option 1: Markdown table
| Product | Price |
|---------|-------|
| Widget | $10 |
Option 2: CSV-style
Product, Price
Widget, $10
Option 3: Linearized prose
"Product: Widget, Price: $10"
Option 4: Just text (structure lost)
"Product Price Widget $10"
Best: Markdown (preserves structure, readable)
Image Alt Text and Captions
Images carry semantic information:
Alt Text:
<img src="diagram.png" alt="System architecture diagram showing 3-tier design">
Extraction Importance:
Without alt text:
→ Image invisible to text extraction
→ "See diagram below" references nothing
→ Incomplete information
With alt text:
→ "System architecture diagram showing 3-tier design"
→ LLM has description of visual
→ Can partially answer questions about diagram
Alt text is critical content
→ Must include in extraction
Figure Captions:
<figure>
<img src="chart.png" alt="Performance chart">
<figcaption>Figure 1: Query performance over time</figcaption>
</figure>
Should extract:
"Figure 1: Query performance over time. [Image: Performance chart]"
Both caption and alt text provide context
Microdata and Structured Data
Schema.org and other structured markup:
JSON-LD:
<script type="application/ld+json">
{
"@type": "Article",
"headline": "Getting Started Guide",
"author": "John Smith",
"datePublished": "2024-01-15"
}
</script>
Extraction Opportunity:
Structured data provides:
→ Article title
→ Author
→ Date
→ Other metadata
Can augment extracted text:
"Getting Started Guide (by John Smith, published 2024-01-15)"
Adds context beyond visible text
Single Page Applications (SPAs)
JavaScript-rendered content:
Initial HTML (before JS):
<div id="root"></div>
<script src="app.js"></script>
After JavaScript Executes:
<div id="root">
<h1>Welcome</h1>
<p>Actual content rendered by React...</p>
</div>
The Empty Shell Problem:
HTTP request fetches:
→ Empty <div id="root"></div>
→ No content visible
Text extraction:
→ Nothing to extract!
Solution required:
→ Headless browser (Puppeteer, Playwright)
→ Execute JavaScript
→ Wait for content to load
→ Then extract
10-100x slower than static HTML extraction
→ But necessary for SPAs
How to Solve
Use semantic HTML tags to identify content areas (article, main) + strip navigation, headers, footers + exclude display:none elements + extract alt text from images + use headless browser for JavaScript-rendered content + convert tables to markdown format. See HTML Extraction.
Agent Instructions: Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.
Perform an HTTP GET request on the current page URL with the ask query parameter:
GET /dev/rag-scenarios-and-solutions/chunking/html-conversion.md?ask=<question>
The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.
Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
Related Pages
Last updated January 26, 2026


