Website Crawling

Crawl public websites and index HTML content.

Overview

Property	Value
Type	Dynamic (scheduled crawls)
Sync Schedule	Hourly, Daily, Weekly, Manual
Plan	All plans
Authentication	Public sites only (no login)
Max Pages	10,000 per data source
Crawl Depth	Configurable (default: 3 levels)

Use Cases

Public documentation: API docs, user guides
Help centers: Support articles, FAQs
Blog archives: Technical posts, announcements
Product pages: Features, pricing, comparisons

How Crawling Works

Process:

Fetch start URL HTML
Parse HTML, extract text and links
Follow links within same domain (respects robots.txt)
Repeat steps 1-3 up to maxDepth levels
Chunk pages (512 tokens/chunk)
Embed chunks (OpenAI ada-002)
Index to Pinecone

Scope rules:

Same domain only (e.g., docs.example.com won't crawl api.example.com)
HTML pages only (skips PDFs, images, videos)
Respects robots.txt disallow rules
Deduplicates by URL (ignores query params by default)
Rate limited: 1 req/second to target site

How to Add a Website

Step 1: Navigate to Data Sources

Log in to your Twig AI account
Click Data in the main navigation menu
Click Add Data Source or the + button

Step 2: Select Website Connector

Choose Website from the list of connectors
The connector shows: "Reads a publicly accessible websites"

Step 3: Configure the Data Source

Basic Information

Name (required): Descriptive name for the website
- Example: "Product Documentation", "Support Knowledge Base", "Company Blog"
Description (optional): Additional context
- Example: "Official product documentation site with API reference and user guides"

URL Configuration

URL (required): The starting URL to crawl
- Must be a valid URL starting with http:// or https://
- Example: https://docs.example.com
- Example: https://help.example.com/en/

URL Selection Tips:

Start at the most relevant section (e.g., /docs/ instead of homepage)
Use URLs with clear structure and navigation
Avoid URLs with query parameters if possible

Advanced Parameters (JSON)

You can configure advanced crawling options using JSON in the Parameters field:

{
  "maxDepth": 3,
  "maxPages": 100,
  "includePatterns": ["/docs/", "/help/"],
  "excludePatterns": ["/blog/", "/news/"],
  "followExternalLinks": false
}

Available Parameters:

Parameter	Type	Description	Default
`maxDepth`	Number	Maximum link depth from start URL	3
`maxPages`	Number	Maximum pages to crawl	100
`includePatterns`	Array	URL patterns to include	All
`excludePatterns`	Array	URL patterns to exclude	None
`followExternalLinks`	Boolean	Crawl external domains	false
`respectRobotsTxt`	Boolean	Follow robots.txt rules	true
`userAgent`	String	Custom user agent string	Twig Bot

Refresh Frequency

Choose how often to re-crawl the website:

Never - Manual refresh only (static)
Daily - Refresh every day
Weekly - Refresh every week
Monthly - Refresh every month

Recommendation:

Daily: For frequently updated sites (news, blogs)
Weekly: For moderately updated sites (documentation)
Monthly: For stable content (marketing pages)

Tags (Optional)

Add tags for organization:

Examples: "documentation", "external", "support", "public"

Step 4: Save and Crawl

Click Save or Create
Initial crawl begins automatically
Monitor status in the data sources list

Step 5: Verify Crawl

Check record count (number of pages crawled)
Verify status shows "END_PROCESS"
Review process logs for any errors
Test knowledge with relevant questions

Examples

Example 1: Documentation Site

Name: Product Documentation
Description: Official API and user guide documentation
URL: https://docs.example.com
Parameters: 
{
  "maxDepth": 5,
  "maxPages": 500,
  "includePatterns": ["/docs/"],
  "excludePatterns": ["/blog/", "/changelog/"]
}
Refresh: Weekly
Tags: documentation, public, api

Example 2: Help Center

Name: Customer Support Articles
Description: Complete help center with FAQs and troubleshooting guides
URL: https://help.example.com/en/
Parameters:
{
  "maxDepth": 3,
  "maxPages": 200,
  "includePatterns": ["/en/articles/"],
  "excludePatterns": ["/community/"]
}
Refresh: Daily
Tags: support, help-center, customer-facing

Example 3: Company Blog

Name: Company Blog
Description: Technical blog posts and product announcements
URL: https://blog.example.com
Parameters:
{
  "maxDepth": 2,
  "maxPages": 100,
  "includePatterns": ["/technical/", "/products/"],
  "excludePatterns": ["/authors/", "/tags/"]
}
Refresh: Weekly
Tags: blog, marketing, technical

Best Practices

1. Choose the Right Starting URL

Good Starting Points:

/docs/ - Documentation root
/help/en/ - Help center in specific language
/api/reference/ - API documentation section
/kb/ - Knowledge base root

Avoid Starting From:

Homepage (too broad, many irrelevant links)
Login pages (can't be crawled)
Dynamic search results
Paginated archives without limit

2. Use Include/Exclude Patterns

Include Patterns - Only crawl these sections:

{
  "includePatterns": [
    "/docs/",
    "/api-reference/",
    "/getting-started/"
  ]
}

Exclude Patterns - Skip these sections:

{
  "excludePatterns": [
    "/changelog/",
    "/blog/",
    "/about/",
    "/careers/",
    "/login",
    "/signup"
  ]
}

3. Optimize Crawl Limits

For Large Sites:

{
  "maxDepth": 4,
  "maxPages": 500
}

For Small Sites:

{
  "maxDepth": 10,
  "maxPages": 100
}

Balance: Higher depth = more comprehensive, longer crawl time

4. Set Appropriate Refresh Schedules

Daily: News sites, rapidly changing content
Weekly: Documentation with regular updates
Monthly: Stable marketing content
Manual (Never): One-time imports, archived content

5. Monitor Crawl Health

Regularly check:

Number of pages crawled (is it increasing/decreasing?)
Last successful crawl date
Error logs for failed pages
Response time and crawl duration

Advanced Configuration

Handling Authentication

The Website connector only supports public websites. For authenticated content, use alternative connectors:

Confluence - For Atlassian Confluence
SharePoint - For Microsoft SharePoint
Google Drive - For Google Docs

Crawling Subdomains

To crawl multiple subdomains:

Option 1: Create separate data sources for each subdomain

Source 1: https://docs.example.com
Source 2: https://api.example.com
Source 3: https://help.example.com

Option 2: Enable external link following (use cautiously)

{
  "followExternalLinks": true,
  "includePatterns": [
    "docs.example.com",
    "api.example.com"
  ]
}

Handling Dynamic Content

JavaScript-Rendered Content: The crawler can handle basic JavaScript rendering but may miss:

Complex single-page applications (SPAs)
Content loaded after user interaction
Infinite scroll content

Solutions:

Check if site has a static HTML fallback
Look for sitemap.xml (use Sitemap connector)
Contact site owner about crawler-friendly version

Rate Limiting

The crawler automatically:

Waits between requests (polite crawling)
Respects server Retry-After headers
Backs off on errors
Distributes load over time

Troubleshooting

Few Pages Crawled

Problem: Only 1-2 pages crawled from a large site

Solutions:

Check maxDepth is sufficient
Verify maxPages limit isn't too low
Review includePatterns aren't too restrictive
Ensure site has proper internal linking
Check robots.txt isn't blocking crawler

Missing Content

Problem: Important pages not included

Solutions:

Verify pages are linked from start URL
Check pages aren't excluded by patterns
Ensure pages are within maxDepth limit
Look for pages in robots.txt disallow list
Check if pages require authentication

Crawl Timeout

Problem: Crawl stops before completing

Solutions:

Reduce maxPages limit
Decrease maxDepth
Add more specific includePatterns
Check if site is responding slowly
Try crawling specific sections separately

Duplicate Content

Problem: Same content appears multiple times

Solutions:

The crawler should handle duplicates automatically
Check for URL variations (with/without trailing slash)
Review query parameters in URLs
Add exclude patterns for redundant paths

Refresh Not Working

Problem: Scheduled refresh isn't updating content

Solutions:

Verify refresh frequency is not set to "NEVER"
Check last processed date
Review process logs for errors
Ensure website is accessible
Check for crawler blocking (robots.txt, firewall)

Performance Tips

1. Start Specific, Expand Later

Begin with a narrow scope:

{
  "includePatterns": ["/docs/getting-started/"],
  "maxPages": 50
}

Then expand as needed:

{
  "includePatterns": ["/docs/"],
  "maxPages": 200
}

2. Use Multiple Focused Sources

Instead of one broad crawl:

❌ Start URL: https://example.com (crawls entire site)

Create targeted sources:

✅ Source 1: https://docs.example.com/api/ (API docs only)
✅ Source 2: https://docs.example.com/guides/ (User guides only)
✅ Source 3: https://help.example.com/ (Support articles only)

3. Exclude Non-Essential Content

Common exclusions:

{
  "excludePatterns": [
    "/search",
    "/tags/",
    "/categories/",
    "/authors/",
    "/archive/",
    "/print/",
    "/download/",
    "/comments"
  ]
}

Monitoring & Maintenance

Regular Checks

Weekly:

Review page count trends
Check for crawl errors
Verify refresh is working

Monthly:

Audit included/excluded pages
Optimize crawl parameters
Remove outdated sources

Quarterly:

Review AI answer quality from website data
Update include/exclude patterns
Adjust refresh frequency based on site update patterns

Metrics to Track

Pages Crawled: Total number of indexed pages
Last Sync: When was the last successful crawl
Error Rate: Percentage of failed page fetches
Crawl Duration: Time taken for full crawl
Usage: How often AI references this content

Next Steps

After setting up website crawling:

Test your AI agent with website content questions
Create specialized agents for different site sections
Monitor analytics to see which pages are most useful
Optimize crawl configuration based on usage patterns

Sitemap - Import from sitemap.xml files
Confluence - For Confluence-hosted documentation
Files - Upload exported HTML or PDF documentation
Google Drive - For Google Docs-based documentation

Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the ask query parameter:

GET /dev/product/data-integrations/website.md?ask=<question>

The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.

Key Takeaways

Overview

Use Cases

How Crawling Works

How to Add a Website

Step 1: Navigate to Data Sources

Step 2: Select Website Connector

Step 3: Configure the Data Source

Basic Information

URL Configuration

Advanced Parameters (JSON)

Refresh Frequency

Tags (Optional)

Step 4: Save and Crawl

Step 5: Verify Crawl

Examples

Example 1: Documentation Site

Example 2: Help Center

Example 3: Company Blog

Best Practices

1. Choose the Right Starting URL

2. Use Include/Exclude Patterns

3. Optimize Crawl Limits

4. Set Appropriate Refresh Schedules

5. Monitor Crawl Health

Advanced Configuration

Handling Authentication

Crawling Subdomains

Handling Dynamic Content

Rate Limiting

Troubleshooting

Few Pages Crawled

Missing Content

Crawl Timeout

Duplicate Content

Refresh Not Working

Performance Tips

1. Start Specific, Expand Later

2. Use Multiple Focused Sources

3. Exclude Non-Essential Content

Monitoring & Maintenance

Regular Checks

Metrics to Track

Next Steps

Related Connectors

Agent Instructions: Querying This Documentation

Related Pages

Integrations

Industries

Comparisons

Compliance

Investors

Industry