Website Scraping Problems

The Problem

Your website data source fails to crawl, only scrapes the homepage, or extracts garbled text instead of clean content.

Symptoms

❌ Only 1-2 pages scraped from a 200-page site
❌ Content shows HTML tags mixed with text
❌ "Timeout" or "Connection Refused" errors
❌ JavaScript-rendered content missing
❌ Infinite crawl never completes

Real-World Example

Your documentation site: docs.company.com (300 pages)
After crawl: Only 8 pages in knowledge base

Pages scraped:
✓ /getting-started
✗ /api-reference (JavaScript-rendered)
✗ /guides/* (blocked by robots.txt)
✗ /admin/* (requires authentication)

Status: "Crawl completed with warnings"

Deep Technical Analysis

The JavaScript Rendering Problem

Modern documentation sites use JavaScript frameworks (React, Vue, Next.js) that render content client-side:

Traditional HTML (easy to scrape):

HTTP Request → Server returns:
<html>
  <body>
    <h1>Getting Started</h1>
    <p>This is the content...</p>
  </body>
</html>

Modern SPA (hard to scrape):

HTTP Request → Server returns:
<html>
  <body>
    <div id="root"></div>
    <script src="app.js"></script>
  </body>
</html>

Actual content lives in app.js:
→ Browser downloads JavaScript
→ JavaScript executes
→ React renders content dynamically
→ DOM populated with <h1>, <p>, etc.

Why This Breaks Scraping:

Standard HTTP Scraper:
1. GET https://docs.company.com/api-reference
2. Receive HTML response
3. Parse HTML with BeautifulSoup/Cheerio
4. Extract text from <body>

Result:
<div id="root"></div>  ← Empty!

The actual content never appears because:
→ Scraper doesn't execute JavaScript
→ No browser engine to render React components
→ Only sees the empty shell HTML

The Headless Browser Requirement:

To scrape JavaScript sites:
1. Launch headless Chrome (Puppeteer/Playwright)
2. Navigate to URL
3. Wait for JavaScript to execute
4. Wait for DOM to fully render
5. Extract rendered HTML
6. Parse content

Cost:
→ 50-100x slower than HTTP-only scraping
→ Requires Chrome/Chromium binary
→ High memory usage (each page = browser instance)
→ Prone to timeouts and crashes

Robots.txt and Crawl Restrictions

Many sites explicitly block scrapers via robots.txt:

Example robots.txt:

User-agent: *
Disallow: /admin/
Disallow: /internal/
Disallow: /api/
Crawl-delay: 10

User-agent: GPTBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

The Compliance Dilemma:

Ethical Scraping:
→ Must respect robots.txt
→ If /api-reference blocked → skip it
→ If crawl-delay: 10 → wait 10 seconds between requests

But:
→ User expects ALL documentation in knowledge base
→ "Why isn't /api-reference indexed?"
→ "The scraper is broken!"

Reality: It's obeying robots.txt
User wants: Override restrictions
Twig must: Follow legal/ethical guidelines

The Dynamic Robots.txt Problem:

Scenario:
1. Twig crawls site at 10:00 AM
2. robots.txt allows everything
3. 500 pages scraped successfully
4. Site owner adds GPTBot block at 2:00 PM
5. Next sync at 3:00 PM
6. Twig re-crawls, now blocked

Result:
→ Previous 500 pages stay in vector DB
→ New pages can't be added
→ Knowledge base becomes stale
→ User sees "sync stopped working"

Infinite Crawl and Link Cycle Detection

Web crawlers can get stuck in infinite loops:

The Pagination Problem:

Blog with infinite scroll:
/blog → /blog?page=2 → /blog?page=3 → ... → /blog?page=999

Each page has "Next →" link
Crawler follows every link
Never terminates

Or:
/search?q=test → /search?q=test&sort=date → /search?q=test&sort=date&page=2
Each URL is unique, but content is identical

Cycle Detection Challenge:

URL normalization issues:
→ https://docs.company.com/guide
→ https://docs.company.com/guide/
→ https://docs.company.com/guide?ref=sidebar
→ https://docs.company.com/guide#overview

Are these the same page? To a crawler:
→ Different URLs (should crawl all)
To a human:
→ Same content (only crawl once)

Deduplication logic must:
1. Normalize URLs (remove trailing slash, query params, anchors)
2. Compare content hashes
3. Skip if already seen

The Subdomain Explosion:

Start URL: https://docs.company.com

Crawler finds links:
→ https://api.company.com (different subdomain)
→ https://blog.company.com (different subdomain)
→ https://status.company.com (different subdomain)
→ https://company.com (main site)

Should crawler follow these?
→ Yes? Might scrape entire company website (thousands of pages)
→ No? Might miss important API documentation on api.company.com

User expects: "Just crawl the docs"
Reality: "Docs" spans 5 subdomains

Authentication and Gated Content

Many documentation sites require login:

The Auth Wall:

Public docs: https://docs.company.com/getting-started
Internal docs: https://docs.company.com/internal (401 Unauthorized)

HTTP Authentication Methods:

1. Basic Auth:
   Authorization: Basic base64(username:password)

2. Cookie-based session:
   → Login at /login
   → Receive session cookie
   → Include cookie in subsequent requests

3. JWT token:
   Authorization: Bearer <token>

4. OAuth:
   → Complex flow with redirects
   → Hard to automate

Scraping Challenge:

To scrape authenticated content:
1. Provide credentials to crawler
2. Crawler performs login
3. Obtains session/token
4. Includes auth in all requests

But:
→ User must provide credentials
→ Security risk (storing passwords)
→ Sessions expire → re-login needed
→ MFA/2FA blocks automated login
→ Some sites detect bot behavior and block

The Mixed Auth Problem:

Site structure:
/getting-started (public)
/advanced-features (public)
/admin-guide (requires login)
/enterprise-setup (requires login + admin role)

Crawler behavior:
→ Crawls public pages fine
→ Hits /admin-guide → 401 error
→ Should it:
  a) Stop crawling? (incomplete knowledge base)
  b) Skip and continue? (missing content)
  c) Prompt user for credentials? (friction)

Content Extraction Accuracy

Extracting clean text from HTML is harder than it appears:

The Navigation/Footer Problem:

<html>
  <nav>Home | About | Contact | Blog | Docs</nav>
  <main>
    <h1>Getting Started</h1>
    <p>This is the actual content...</p>
  </main>
  <footer>© 2024 Company | Privacy | Terms</footer>
</html>

Naive extraction:

Extract all text from <body>:
"Home About Contact Blog Docs Getting Started This is the actual content... © 2024 Company Privacy Terms"

Problem: Navigation and footer mixed with content

Better extraction:

Use heuristics:
→ Identify <main>, <article>, or role="main"
→ Ignore <nav>, <header>, <footer>
→ Score elements by text density
→ Extract only high-density content areas

But:
→ Not all sites use semantic HTML
→ Heuristics fail on unusual layouts
→ May miss legitimate content in sidebars

The Code Block Problem:

<pre><code>
npm install @company/sdk
npm start
</code></pre>

Extraction challenge:

Should crawler:
→ Include code blocks as-is?
→ Preserve formatting (indentation, newlines)?
→ Add metadata like "language: bash"?
→ Treat code differently than prose in embeddings?

Code in RAG:
→ User asks: "How do I install the SDK?"
→ Retrieved chunk: "npm install @company/sdk"
→ LLM needs to recognize this is a command, not prose
→ Formatting matters for code comprehension

Sitemap vs Crawl Strategy

Sites may provide sitemaps for easier indexing:

Sitemap.xml:

<urlset>
  <url>
    <loc>https://docs.company.com/getting-started</loc>
    <lastmod>2024-01-15</lastmod>
    <priority>1.0</priority>
  </url>
  <url>
    <loc>https://docs.company.com/api-reference</loc>
    <lastmod>2024-01-20</lastmod>
    <priority>0.8</priority>
  </url>
</urlset>

Benefits:

Sitemap-based crawl:
→ Explicit list of all pages
→ Includes lastmod → skip unchanged pages
→ Faster than link crawling
→ No infinite loop risk

Limitations:

Not all sites have sitemaps
Sitemaps may be:
→ Outdated (missing new pages)
→ Incomplete (manually maintained, forgotten pages)
→ Too large (split across multiple files)
→ Exclude authenticated pages

Hybrid Strategy:

1. Check for sitemap.xml
2. If found: Use as primary source
3. Also crawl from homepage
4. Compare: sitemap URLs vs discovered URLs
5. Union of both sets = complete coverage

But:
→ More complexity
→ Longer crawl time
→ Higher risk of duplicates

Rate Limiting and Politeness

Aggressive crawling can overload servers:

The Server Load Problem:

Naive crawler:
→ 10 concurrent requests
→ 500 pages total
→ Completes in 30 seconds

Server perspective:
→ 10 simultaneous connections
→ High CPU/memory usage
→ Looks like DDoS attack
→ May trigger rate limiting or IP ban

Polite Crawling:

Best practices:
→ 1 request at a time (or max 2-3)
→ 1-2 second delay between requests
→ Respect Crawl-delay in robots.txt
→ Use consistent User-Agent
→ Handle 429 (rate limit) with exponential backoff

But:
→ 500 pages × 2 seconds = 16 minutes
→ User sees "crawl taking forever"
→ Impatient user cancels
→ Incomplete knowledge base

How to Solve

Use headless browser for JavaScript sites + respect robots.txt + implement URL deduplication + extract only main content area + rate limit requests. See Website Data Sources for configuration.

Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the ask query parameter:

GET /dev/rag-scenarios-and-solutions/data-integration/website-scraping.md?ask=<question>

The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.