Rag Scenarios And Solutions
Website Scraping Problems
Your website data source fails to crawl, only scrapes the homepage, or extracts garbled text instead of clean content.
TL;DR
Your website data source fails to crawl, only scrapes the homepage, or extracts garbled text instead of clean content.
Key Takeaways
- The Problem
- Deep Technical Analysis
- How to Solve
- Agent Instructions: Querying This Documentation
The Problem
Your website data source fails to crawl, only scrapes the homepage, or extracts garbled text instead of clean content.
Symptoms
- ❌ Only 1-2 pages scraped from a 200-page site
- ❌ Content shows HTML tags mixed with text
- ❌ "Timeout" or "Connection Refused" errors
- ❌ JavaScript-rendered content missing
- ❌ Infinite crawl never completes
Real-World Example
Your documentation site: docs.company.com (300 pages)
After crawl: Only 8 pages in knowledge base
Pages scraped:
✓ /getting-started
✗ /api-reference (JavaScript-rendered)
✗ /guides/* (blocked by robots.txt)
✗ /admin/* (requires authentication)
Status: "Crawl completed with warnings"
Deep Technical Analysis
The JavaScript Rendering Problem
Modern documentation sites use JavaScript frameworks (React, Vue, Next.js) that render content client-side:
Traditional HTML (easy to scrape):
HTTP Request → Server returns:
<html>
<body>
<h1>Getting Started</h1>
<p>This is the content...</p>
</body>
</html>
Modern SPA (hard to scrape):
HTTP Request → Server returns:
<html>
<body>
<div id="root"></div>
<script src="app.js"></script>
</body>
</html>
Actual content lives in app.js:
→ Browser downloads JavaScript
→ JavaScript executes
→ React renders content dynamically
→ DOM populated with <h1>, <p>, etc.
Why This Breaks Scraping:
Standard HTTP Scraper:
1. GET https://docs.company.com/api-reference
2. Receive HTML response
3. Parse HTML with BeautifulSoup/Cheerio
4. Extract text from <body>
Result:
<div id="root"></div> ← Empty!
The actual content never appears because:
→ Scraper doesn't execute JavaScript
→ No browser engine to render React components
→ Only sees the empty shell HTML
The Headless Browser Requirement:
To scrape JavaScript sites:
1. Launch headless Chrome (Puppeteer/Playwright)
2. Navigate to URL
3. Wait for JavaScript to execute
4. Wait for DOM to fully render
5. Extract rendered HTML
6. Parse content
Cost:
→ 50-100x slower than HTTP-only scraping
→ Requires Chrome/Chromium binary
→ High memory usage (each page = browser instance)
→ Prone to timeouts and crashes
Robots.txt and Crawl Restrictions
Many sites explicitly block scrapers via robots.txt:
Example robots.txt:
User-agent: *
Disallow: /admin/
Disallow: /internal/
Disallow: /api/
Crawl-delay: 10
User-agent: GPTBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
The Compliance Dilemma:
Ethical Scraping:
→ Must respect robots.txt
→ If /api-reference blocked → skip it
→ If crawl-delay: 10 → wait 10 seconds between requests
But:
→ User expects ALL documentation in knowledge base
→ "Why isn't /api-reference indexed?"
→ "The scraper is broken!"
Reality: It's obeying robots.txt
User wants: Override restrictions
Twig must: Follow legal/ethical guidelines
The Dynamic Robots.txt Problem:
Scenario:
1. Twig crawls site at 10:00 AM
2. robots.txt allows everything
3. 500 pages scraped successfully
4. Site owner adds GPTBot block at 2:00 PM
5. Next sync at 3:00 PM
6. Twig re-crawls, now blocked
Result:
→ Previous 500 pages stay in vector DB
→ New pages can't be added
→ Knowledge base becomes stale
→ User sees "sync stopped working"
Infinite Crawl and Link Cycle Detection
Web crawlers can get stuck in infinite loops:
The Pagination Problem:
Blog with infinite scroll:
/blog → /blog?page=2 → /blog?page=3 → ... → /blog?page=999
Each page has "Next →" link
Crawler follows every link
Never terminates
Or:
/search?q=test → /search?q=test&sort=date → /search?q=test&sort=date&page=2
Each URL is unique, but content is identical
Cycle Detection Challenge:
URL normalization issues:
→ https://docs.company.com/guide
→ https://docs.company.com/guide/
→ https://docs.company.com/guide?ref=sidebar
→ https://docs.company.com/guide#overview
Are these the same page? To a crawler:
→ Different URLs (should crawl all)
To a human:
→ Same content (only crawl once)
Deduplication logic must:
1. Normalize URLs (remove trailing slash, query params, anchors)
2. Compare content hashes
3. Skip if already seen
The Subdomain Explosion:
Start URL: https://docs.company.com
Crawler finds links:
→ https://api.company.com (different subdomain)
→ https://blog.company.com (different subdomain)
→ https://status.company.com (different subdomain)
→ https://company.com (main site)
Should crawler follow these?
→ Yes? Might scrape entire company website (thousands of pages)
→ No? Might miss important API documentation on api.company.com
User expects: "Just crawl the docs"
Reality: "Docs" spans 5 subdomains
Authentication and Gated Content
Many documentation sites require login:
The Auth Wall:
Public docs: https://docs.company.com/getting-started
Internal docs: https://docs.company.com/internal (401 Unauthorized)
HTTP Authentication Methods:
1. Basic Auth:
Authorization: Basic base64(username:password)
2. Cookie-based session:
→ Login at /login
→ Receive session cookie
→ Include cookie in subsequent requests
3. JWT token:
Authorization: Bearer <token>
4. OAuth:
→ Complex flow with redirects
→ Hard to automate
Scraping Challenge:
To scrape authenticated content:
1. Provide credentials to crawler
2. Crawler performs login
3. Obtains session/token
4. Includes auth in all requests
But:
→ User must provide credentials
→ Security risk (storing passwords)
→ Sessions expire → re-login needed
→ MFA/2FA blocks automated login
→ Some sites detect bot behavior and block
The Mixed Auth Problem:
Site structure:
/getting-started (public)
/advanced-features (public)
/admin-guide (requires login)
/enterprise-setup (requires login + admin role)
Crawler behavior:
→ Crawls public pages fine
→ Hits /admin-guide → 401 error
→ Should it:
a) Stop crawling? (incomplete knowledge base)
b) Skip and continue? (missing content)
c) Prompt user for credentials? (friction)
Content Extraction Accuracy
Extracting clean text from HTML is harder than it appears:
The Navigation/Footer Problem:
<html>
<nav>Home | About | Contact | Blog | Docs</nav>
<main>
<h1>Getting Started</h1>
<p>This is the actual content...</p>
</main>
<footer>© 2024 Company | Privacy | Terms</footer>
</html>
Naive extraction:
Extract all text from <body>:
"Home About Contact Blog Docs Getting Started This is the actual content... © 2024 Company Privacy Terms"
Problem: Navigation and footer mixed with content
Better extraction:
Use heuristics:
→ Identify <main>, <article>, or role="main"
→ Ignore <nav>, <header>, <footer>
→ Score elements by text density
→ Extract only high-density content areas
But:
→ Not all sites use semantic HTML
→ Heuristics fail on unusual layouts
→ May miss legitimate content in sidebars
The Code Block Problem:
<pre><code>
npm install @company/sdk
npm start
</code></pre>
Extraction challenge:
Should crawler:
→ Include code blocks as-is?
→ Preserve formatting (indentation, newlines)?
→ Add metadata like "language: bash"?
→ Treat code differently than prose in embeddings?
Code in RAG:
→ User asks: "How do I install the SDK?"
→ Retrieved chunk: "npm install @company/sdk"
→ LLM needs to recognize this is a command, not prose
→ Formatting matters for code comprehension
Sitemap vs Crawl Strategy
Sites may provide sitemaps for easier indexing:
Sitemap.xml:
<urlset>
<url>
<loc>https://docs.company.com/getting-started</loc>
<lastmod>2024-01-15</lastmod>
<priority>1.0</priority>
</url>
<url>
<loc>https://docs.company.com/api-reference</loc>
<lastmod>2024-01-20</lastmod>
<priority>0.8</priority>
</url>
</urlset>
Benefits:
Sitemap-based crawl:
→ Explicit list of all pages
→ Includes lastmod → skip unchanged pages
→ Faster than link crawling
→ No infinite loop risk
Limitations:
Not all sites have sitemaps
Sitemaps may be:
→ Outdated (missing new pages)
→ Incomplete (manually maintained, forgotten pages)
→ Too large (split across multiple files)
→ Exclude authenticated pages
Hybrid Strategy:
1. Check for sitemap.xml
2. If found: Use as primary source
3. Also crawl from homepage
4. Compare: sitemap URLs vs discovered URLs
5. Union of both sets = complete coverage
But:
→ More complexity
→ Longer crawl time
→ Higher risk of duplicates
Rate Limiting and Politeness
Aggressive crawling can overload servers:
The Server Load Problem:
Naive crawler:
→ 10 concurrent requests
→ 500 pages total
→ Completes in 30 seconds
Server perspective:
→ 10 simultaneous connections
→ High CPU/memory usage
→ Looks like DDoS attack
→ May trigger rate limiting or IP ban
Polite Crawling:
Best practices:
→ 1 request at a time (or max 2-3)
→ 1-2 second delay between requests
→ Respect Crawl-delay in robots.txt
→ Use consistent User-Agent
→ Handle 429 (rate limit) with exponential backoff
But:
→ 500 pages × 2 seconds = 16 minutes
→ User sees "crawl taking forever"
→ Impatient user cancels
→ Incomplete knowledge base
How to Solve
Use headless browser for JavaScript sites + respect robots.txt + implement URL deduplication + extract only main content area + rate limit requests. See Website Data Sources for configuration.
Agent Instructions: Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.
Perform an HTTP GET request on the current page URL with the ask query parameter:
GET /dev/rag-scenarios-and-solutions/data-integration/website-scraping.md?ask=<question>
The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.
Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
Last updated January 26, 2026


