Product

Sitemap Integration

Import website content using a sitemap.xml file to efficiently index large documentation sites, blogs, and other structured web content.

TL;DR

Import website content using a sitemap.xml file to efficiently index large documentation sites, blogs, and other structured web content.

Key Takeaways

  • Overview
  • When to Use Sitemap Connector
  • What is a Sitemap.xml?
  • Finding a Website's Sitemap
  • How to Add a Sitemap
  • Creating Custom Sitemaps

Import website content using a sitemap.xml file to efficiently index large documentation sites, blogs, and other structured web content.

Overview

PropertyDetails
TypeStatic
RefreshManual
Tier1 (All Plans)
Formatsitemap.xml file
Max URLsVaries by plan

When to Use Sitemap Connector

The Sitemap connector is ideal for:

  • Large Documentation Sites - Efficiently import hundreds of pages
  • Structured Content - Sites with well-organized sitemaps
  • Static Site Generators - Jekyll, Hugo, Docusaurus, etc.
  • Archived Content - One-time import of website snapshots
  • Selective Imports - When you want specific URLs from a site

What is a Sitemap.xml?

A sitemap.xml file is a list of URLs on a website, typically used to help search engines discover and index pages. It looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://docs.example.com/getting-started</loc>
    <lastmod>2024-01-15</lastmod>
  </url>
  <url>
    <loc>https://docs.example.com/api-reference</loc>
    <lastmod>2024-01-14</lastmod>
  </url>
  <url>
    <loc>https://docs.example.com/tutorials</loc>
    <lastmod>2024-01-10</lastmod>
  </url>
</urlset>

Finding a Website's Sitemap

Common Sitemap Locations

Most websites place their sitemap at:

  • https://example.com/sitemap.xml
  • https://example.com/sitemap_index.xml
  • https://example.com/sitemap1.xml
  • https://docs.example.com/sitemap.xml

Check robots.txt

Many sites list their sitemap in robots.txt:

https://example.com/robots.txt

Look for:

Sitemap: https://example.com/sitemap.xml

Use Browser Tools

  1. Open the website in your browser
  2. Right-click → View Page Source
  3. Search (Ctrl+F / Cmd+F) for "sitemap"
  4. Look for <link rel="sitemap" tags

Ask the Website Owner

If you can't find the sitemap, contact the site administrator. They can:

  • Provide the sitemap URL
  • Generate a sitemap if one doesn't exist
  • Create a custom sitemap with specific pages

How to Add a Sitemap

Step 1: Download the Sitemap

  1. Navigate to the sitemap URL in your browser
  2. Right-click on the page
  3. Select "Save As" or "Save Page As"
  4. Save with filename: sitemap.xml

Alternative: Use command line:

curl https://docs.example.com/sitemap.xml -o sitemap.xml

or

wget https://docs.example.com/sitemap.xml

Step 2: Navigate to Data Sources

  1. Log in to your Twig AI account
  2. Click Data in the main navigation menu
  3. Click Add Data Source or the + button

Step 3: Select Sitemap Connector

  1. Choose Sitemap.xml from the list
  2. The connector shows: "Publicly accessible websites from a sitemap.xml file"

Step 4: Configure the Data Source

Basic Information

  • Name (required): Descriptive name
    • Example: "Documentation Sitemap", "Blog Sitemap", "Help Center Pages"
  • Description (optional): Additional context
    • Example: "Complete documentation site from sitemap dated 2024-01-15"

File Upload

  1. Click Choose File or drag-and-drop
  2. Select your downloaded sitemap.xml file
  3. Wait for upload to complete

Tags (Optional)

  • Add organizational tags
  • Examples: "documentation", "external", "sitemap"

Step 5: Save and Process

  1. Click Save or Create
  2. System will:
    • Parse the sitemap file
    • Fetch each URL listed
    • Extract and index content
  3. Monitor processing status

Step 6: Verify Import

  1. Check record count (number of URLs processed)
  2. Verify status shows "END_PROCESS"
  3. Review process logs for any failed URLs
  4. Test with relevant questions

Creating Custom Sitemaps

If you need a sitemap for a specific subset of pages, you can create one manually.

Basic Sitemap Structure

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://example.com/page1</loc>
  </url>
  <url>
    <loc>https://example.com/page2</loc>
  </url>
  <url>
    <loc>https://example.com/page3</loc>
  </url>
</urlset>

With Optional Metadata

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://example.com/important-page</loc>
    <lastmod>2024-01-15</lastmod>
    <priority>1.0</priority>
    <changefreq>daily</changefreq>
  </url>
  <url>
    <loc>https://example.com/other-page</loc>
    <lastmod>2024-01-10</lastmod>
    <priority>0.8</priority>
    <changefreq>weekly</changefreq>
  </url>
</urlset>

Optional Tags:

  • <lastmod> - Last modified date (YYYY-MM-DD)
  • <priority> - Importance (0.0 to 1.0)
  • <changefreq> - Update frequency (daily, weekly, monthly)

Using Online Sitemap Generators

Several tools can generate sitemaps:

  • Screaming Frog SEO Spider - Desktop app
  • XML-Sitemaps.com - Online generator
  • Sitemap Writer Pro - Desktop app
  • Custom scripts - Python, Node.js, etc.

Examples

Example 1: Documentation Site

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://docs.example.com/</loc>
  </url>
  <url>
    <loc>https://docs.example.com/getting-started</loc>
  </url>
  <url>
    <loc>https://docs.example.com/api-reference</loc>
  </url>
  <url>
    <loc>https://docs.example.com/tutorials</loc>
  </url>
  <url>
    <loc>https://docs.example.com/faq</loc>
  </url>
</urlset>
Name: Product Documentation
Description: Complete product documentation from sitemap
File: docs-sitemap.xml (5 URLs)
Tags: documentation, product, public

Example 2: Blog Posts

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://blog.example.com/2024/how-to-get-started</loc>
  </url>
  <url>
    <loc>https://blog.example.com/2024/advanced-tips</loc>
  </url>
  <url>
    <loc>https://blog.example.com/2023/year-in-review</loc>
  </url>
</urlset>
Name: Technical Blog Posts
Description: Selected technical blog posts
File: blog-sitemap.xml (3 URLs)
Tags: blog, technical, public

Example 3: Help Center

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://help.example.com/en/articles/account-setup</loc>
  </url>
  <url>
    <loc>https://help.example.com/en/articles/billing-faq</loc>
  </url>
  <url>
    <loc>https://help.example.com/en/articles/troubleshooting</loc>
  </url>
  <url>
    <loc>https://help.example.com/en/articles/api-integration</loc>
  </url>
</urlset>
Name: Help Center Articles
Description: Customer support articles in English
File: help-sitemap.xml (4 URLs)
Tags: support, help, customer-facing

Best Practices

1. Filter Sitemap Content

Before uploading, edit the sitemap to include only relevant pages:

Good:

<url><loc>https://example.com/docs/getting-started</loc></url>
<url><loc>https://example.com/docs/api-reference</loc></url>
<url><loc>https://example.com/docs/tutorials</loc></url>

Remove:

<url><loc>https://example.com/login</loc></url>
<url><loc>https://example.com/signup</loc></url>
<url><loc>https://example.com/checkout</loc></url>
<url><loc>https://example.com/privacy-policy</loc></url>

2. Keep Sitemaps Organized

Create separate data sources for different content types:

  • docs-sitemap.xml - Documentation pages
  • help-sitemap.xml - Support articles
  • blog-sitemap.xml - Blog posts

3. Version Your Sitemaps

When re-importing, keep versions:

docs-sitemap-2024-01.xml
docs-sitemap-2024-02.xml
docs-sitemap-2024-03.xml

4. Validate Before Upload

Use sitemap validators:

5. Check URL Accessibility

Ensure all URLs in sitemap are:

  • Publicly accessible (no authentication required)
  • Returning 200 status code (not 404 or redirects)
  • Containing actual content (not empty pages)

Advantages Over Website Connector

FeatureSitemapWebsite Crawler
SpeedFast (only listed URLs)Slower (discovers links)
PrecisionExact pages you wantMay miss or include extra pages
ControlFull control over URLsLimited by crawler settings
ResourcesLess server loadMore server requests
FreshnessManual update neededCan auto-refresh

Use Sitemap when:

  • You know exactly which pages to import
  • Site has a complete, up-to-date sitemap
  • You want a one-time import
  • You need to minimize server load

Use Website Crawler when:

  • You want automatic discovery
  • Site structure changes frequently
  • You want automatic updates
  • You're not sure which pages exist

Handling Large Sitemaps

Sitemap Index Files

Large sites may use sitemap index files:

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://example.com/sitemap1.xml</loc>
  </sitemap>
  <sitemap>
    <loc>https://example.com/sitemap2.xml</loc>
  </sitemap>
  <sitemap>
    <loc>https://example.com/sitemap3.xml</loc>
  </sitemap>
</sitemapindex>

To import:

  1. Download each individual sitemap
  2. Create separate data sources for each, or
  3. Merge sitemaps into one file before uploading

Merging Multiple Sitemaps

Combine multiple sitemaps into one:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <!-- URLs from sitemap1.xml -->
  <url><loc>https://example.com/page1</loc></url>
  <url><loc>https://example.com/page2</loc></url>
  
  <!-- URLs from sitemap2.xml -->
  <url><loc>https://example.com/page3</loc></url>
  <url><loc>https://example.com/page4</loc></url>
</urlset>

Updating Content

Since Sitemap is a static connector, updates require re-import:

To Update:

  1. Download updated sitemap from website
  2. Edit your data source in Twig
  3. Upload the new sitemap file
  4. Save to reprocess all URLs

Automation Options:

  • Schedule periodic manual updates
  • Use Website connector for automatic updates
  • Set up external scripts to notify you of sitemap changes

Troubleshooting

URLs Not Accessible

Problem: Some URLs fail to process

Solutions:

  • Verify URLs are publicly accessible
  • Check for authentication requirements
  • Test URLs in incognito browser window
  • Review process logs for specific error codes

Invalid Sitemap Format

Problem: Sitemap upload fails

Solutions:

  • Validate XML syntax using online validator
  • Check for proper XML declaration
  • Ensure proper namespace declaration
  • Verify file encoding is UTF-8

Empty Pages Imported

Problem: URLs processed but no content extracted

Solutions:

  • Check if pages contain actual text content
  • Verify pages aren't JavaScript-heavy SPAs
  • Look for content behind login walls
  • Test URL manually in browser

Partial Import

Problem: Only some URLs processed

Solutions:

  • Check plan limits on number of URLs
  • Review process logs for errors
  • Verify failed URLs are accessible
  • Split large sitemaps into multiple sources

Advanced Tips

1. Filtering with Text Editor

Use find-and-replace in text editor to quickly filter sitemaps:

Remove URLs containing "blog":

Find: .*<url>.*blog.*</url>.*\n
Replace: (empty)

Keep only "/docs/" URLs:

  • Copy entire sitemap
  • Delete all content
  • Paste back only lines containing "/docs/"

2. Combining Sitemaps from Different Sites

Create a consolidated sitemap:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <!-- Site 1 docs -->
  <url><loc>https://docs.site1.com/guide</loc></url>
  
  <!-- Site 2 docs -->
  <url><loc>https://docs.site2.com/guide</loc></url>
  
  <!-- Site 3 docs -->
  <url><loc>https://docs.site3.com/guide</loc></url>
</urlset>

3. Priority-Based Import

Create multiple data sources based on priority:

High Priority (daily refresh needed):

<url><loc>https://example.com/getting-started</loc></url>
<url><loc>https://example.com/pricing</loc></url>

Low Priority (rarely changes):

<url><loc>https://example.com/company-history</loc></url>
<url><loc>https://example.com/team</loc></url>

Next Steps

After importing from sitemap:

  1. Test knowledge coverage
  2. Create AI agents for specific content areas
  3. Monitor usage to see which pages are most referenced
  4. Plan periodic sitemap updates
  • Website - Automated web crawling with refresh
  • Files - Upload HTML or PDF exports
  • Confluence - For wiki-based documentation
  • Google Drive - For cloud-hosted documentation

Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the ask query parameter:

GET /dev/product/data-integrations/sitemap.md?ask=<question>

The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.

People also ask

Related Pages

Last updated January 25, 2026