Product
Website Crawling
Crawl public websites and index HTML content
TL;DR
Crawl public websites and index HTML content. | Property | Value | | ------------------ | -------------------------------- | | Type | Dynamic (scheduled crawls) | | Sync Schedule | Hourly, Daily, Weekly, Manual | | Plan ...
Key Takeaways
- Overview
- Use Cases
- How Crawling Works
- How to Add a Website
- Examples
- Best Practices
Crawl public websites and index HTML content.
Overview
| Property | Value |
|---|---|
| Type | Dynamic (scheduled crawls) |
| Sync Schedule | Hourly, Daily, Weekly, Manual |
| Plan | All plans |
| Authentication | Public sites only (no login) |
| Max Pages | 10,000 per data source |
| Crawl Depth | Configurable (default: 3 levels) |
Use Cases
- Public documentation: API docs, user guides
- Help centers: Support articles, FAQs
- Blog archives: Technical posts, announcements
- Product pages: Features, pricing, comparisons
How Crawling Works
Process:
- Fetch start URL HTML
- Parse HTML, extract text and links
- Follow links within same domain (respects
robots.txt) - Repeat steps 1-3 up to
maxDepthlevels - Chunk pages (512 tokens/chunk)
- Embed chunks (OpenAI ada-002)
- Index to Pinecone
Scope rules:
- Same domain only (e.g., docs.example.com won't crawl api.example.com)
- HTML pages only (skips PDFs, images, videos)
- Respects
robots.txtdisallow rules - Deduplicates by URL (ignores query params by default)
- Rate limited: 1 req/second to target site
How to Add a Website
Step 1: Navigate to Data Sources
- Log in to your Twig AI account
- Click Data in the main navigation menu
- Click Add Data Source or the + button
Step 2: Select Website Connector
- Choose Website from the list of connectors
- The connector shows: "Reads a publicly accessible websites"
Step 3: Configure the Data Source
Basic Information
- Name (required): Descriptive name for the website
- Example: "Product Documentation", "Support Knowledge Base", "Company Blog"
- Description (optional): Additional context
- Example: "Official product documentation site with API reference and user guides"
URL Configuration
- URL (required): The starting URL to crawl
- Must be a valid URL starting with
http://orhttps:// - Example:
https://docs.example.com - Example:
https://help.example.com/en/
- Must be a valid URL starting with
URL Selection Tips:
- Start at the most relevant section (e.g.,
/docs/instead of homepage) - Use URLs with clear structure and navigation
- Avoid URLs with query parameters if possible
Advanced Parameters (JSON)
You can configure advanced crawling options using JSON in the Parameters field:
{
"maxDepth": 3,
"maxPages": 100,
"includePatterns": ["/docs/", "/help/"],
"excludePatterns": ["/blog/", "/news/"],
"followExternalLinks": false
}
Available Parameters:
| Parameter | Type | Description | Default |
|---|---|---|---|
maxDepth | Number | Maximum link depth from start URL | 3 |
maxPages | Number | Maximum pages to crawl | 100 |
includePatterns | Array | URL patterns to include | All |
excludePatterns | Array | URL patterns to exclude | None |
followExternalLinks | Boolean | Crawl external domains | false |
respectRobotsTxt | Boolean | Follow robots.txt rules | true |
userAgent | String | Custom user agent string | Twig Bot |
Refresh Frequency
Choose how often to re-crawl the website:
- Never - Manual refresh only (static)
- Daily - Refresh every day
- Weekly - Refresh every week
- Monthly - Refresh every month
Recommendation:
- Daily: For frequently updated sites (news, blogs)
- Weekly: For moderately updated sites (documentation)
- Monthly: For stable content (marketing pages)
Tags (Optional)
Add tags for organization:
- Examples: "documentation", "external", "support", "public"
Step 4: Save and Crawl
- Click Save or Create
- Initial crawl begins automatically
- Monitor status in the data sources list
Step 5: Verify Crawl
- Check record count (number of pages crawled)
- Verify status shows "END_PROCESS"
- Review process logs for any errors
- Test knowledge with relevant questions
Examples
Example 1: Documentation Site
Name: Product Documentation
Description: Official API and user guide documentation
URL: https://docs.example.com
Parameters:
{
"maxDepth": 5,
"maxPages": 500,
"includePatterns": ["/docs/"],
"excludePatterns": ["/blog/", "/changelog/"]
}
Refresh: Weekly
Tags: documentation, public, api
Example 2: Help Center
Name: Customer Support Articles
Description: Complete help center with FAQs and troubleshooting guides
URL: https://help.example.com/en/
Parameters:
{
"maxDepth": 3,
"maxPages": 200,
"includePatterns": ["/en/articles/"],
"excludePatterns": ["/community/"]
}
Refresh: Daily
Tags: support, help-center, customer-facing
Example 3: Company Blog
Name: Company Blog
Description: Technical blog posts and product announcements
URL: https://blog.example.com
Parameters:
{
"maxDepth": 2,
"maxPages": 100,
"includePatterns": ["/technical/", "/products/"],
"excludePatterns": ["/authors/", "/tags/"]
}
Refresh: Weekly
Tags: blog, marketing, technical
Best Practices
1. Choose the Right Starting URL
Good Starting Points:
/docs/- Documentation root/help/en/- Help center in specific language/api/reference/- API documentation section/kb/- Knowledge base root
Avoid Starting From:
- Homepage (too broad, many irrelevant links)
- Login pages (can't be crawled)
- Dynamic search results
- Paginated archives without limit
2. Use Include/Exclude Patterns
Include Patterns - Only crawl these sections:
{
"includePatterns": [
"/docs/",
"/api-reference/",
"/getting-started/"
]
}
Exclude Patterns - Skip these sections:
{
"excludePatterns": [
"/changelog/",
"/blog/",
"/about/",
"/careers/",
"/login",
"/signup"
]
}
3. Optimize Crawl Limits
For Large Sites:
{
"maxDepth": 4,
"maxPages": 500
}
For Small Sites:
{
"maxDepth": 10,
"maxPages": 100
}
Balance: Higher depth = more comprehensive, longer crawl time
4. Set Appropriate Refresh Schedules
- Daily: News sites, rapidly changing content
- Weekly: Documentation with regular updates
- Monthly: Stable marketing content
- Manual (Never): One-time imports, archived content
5. Monitor Crawl Health
Regularly check:
- Number of pages crawled (is it increasing/decreasing?)
- Last successful crawl date
- Error logs for failed pages
- Response time and crawl duration
Advanced Configuration
Handling Authentication
The Website connector only supports public websites. For authenticated content, use alternative connectors:
- Confluence - For Atlassian Confluence
- SharePoint - For Microsoft SharePoint
- Google Drive - For Google Docs
Crawling Subdomains
To crawl multiple subdomains:
Option 1: Create separate data sources for each subdomain
Source 1: https://docs.example.com
Source 2: https://api.example.com
Source 3: https://help.example.com
Option 2: Enable external link following (use cautiously)
{
"followExternalLinks": true,
"includePatterns": [
"docs.example.com",
"api.example.com"
]
}
Handling Dynamic Content
JavaScript-Rendered Content: The crawler can handle basic JavaScript rendering but may miss:
- Complex single-page applications (SPAs)
- Content loaded after user interaction
- Infinite scroll content
Solutions:
- Check if site has a static HTML fallback
- Look for sitemap.xml (use Sitemap connector)
- Contact site owner about crawler-friendly version
Rate Limiting
The crawler automatically:
- Waits between requests (polite crawling)
- Respects server
Retry-Afterheaders - Backs off on errors
- Distributes load over time
Troubleshooting
Few Pages Crawled
Problem: Only 1-2 pages crawled from a large site
Solutions:
- Check
maxDepthis sufficient - Verify
maxPageslimit isn't too low - Review
includePatternsaren't too restrictive - Ensure site has proper internal linking
- Check robots.txt isn't blocking crawler
Missing Content
Problem: Important pages not included
Solutions:
- Verify pages are linked from start URL
- Check pages aren't excluded by patterns
- Ensure pages are within maxDepth limit
- Look for pages in robots.txt disallow list
- Check if pages require authentication
Crawl Timeout
Problem: Crawl stops before completing
Solutions:
- Reduce
maxPageslimit - Decrease
maxDepth - Add more specific
includePatterns - Check if site is responding slowly
- Try crawling specific sections separately
Duplicate Content
Problem: Same content appears multiple times
Solutions:
- The crawler should handle duplicates automatically
- Check for URL variations (with/without trailing slash)
- Review query parameters in URLs
- Add exclude patterns for redundant paths
Refresh Not Working
Problem: Scheduled refresh isn't updating content
Solutions:
- Verify refresh frequency is not set to "NEVER"
- Check last processed date
- Review process logs for errors
- Ensure website is accessible
- Check for crawler blocking (robots.txt, firewall)
Performance Tips
1. Start Specific, Expand Later
Begin with a narrow scope:
{
"includePatterns": ["/docs/getting-started/"],
"maxPages": 50
}
Then expand as needed:
{
"includePatterns": ["/docs/"],
"maxPages": 200
}
2. Use Multiple Focused Sources
Instead of one broad crawl:
❌ Start URL: https://example.com (crawls entire site)
Create targeted sources:
✅ Source 1: https://docs.example.com/api/ (API docs only)
✅ Source 2: https://docs.example.com/guides/ (User guides only)
✅ Source 3: https://help.example.com/ (Support articles only)
3. Exclude Non-Essential Content
Common exclusions:
{
"excludePatterns": [
"/search",
"/tags/",
"/categories/",
"/authors/",
"/archive/",
"/print/",
"/download/",
"/comments"
]
}
Monitoring & Maintenance
Regular Checks
Weekly:
- Review page count trends
- Check for crawl errors
- Verify refresh is working
Monthly:
- Audit included/excluded pages
- Optimize crawl parameters
- Remove outdated sources
Quarterly:
- Review AI answer quality from website data
- Update include/exclude patterns
- Adjust refresh frequency based on site update patterns
Metrics to Track
- Pages Crawled: Total number of indexed pages
- Last Sync: When was the last successful crawl
- Error Rate: Percentage of failed page fetches
- Crawl Duration: Time taken for full crawl
- Usage: How often AI references this content
Next Steps
After setting up website crawling:
- Test your AI agent with website content questions
- Create specialized agents for different site sections
- Monitor analytics to see which pages are most useful
- Optimize crawl configuration based on usage patterns
Related Connectors
- Sitemap - Import from sitemap.xml files
- Confluence - For Confluence-hosted documentation
- Files - Upload exported HTML or PDF documentation
- Google Drive - For Google Docs-based documentation
Agent Instructions: Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.
Perform an HTTP GET request on the current page URL with the ask query parameter:
GET /dev/product/data-integrations/website.md?ask=<question>
The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.
Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
Related Pages
Last updated January 26, 2026


