Product
Chunking Strategies
Chunking is the process of breaking down large documents into smaller, manageable pieces that can be effectively processed and retrieved by your AI agents
TL;DR
Chunking is the process of breaking down large documents into smaller, manageable pieces that can be effectively processed and retrieved by your AI agents. The right chunking strategy significantly impacts the quality and relevance of responses.
Key Takeaways
- What is Chunking?
- Why Chunking Matters
- Chunking Strategies
- Choosing the Right Strategy
- Advanced Techniques
- Implementation Guide
Chunking is the process of breaking down large documents into smaller, manageable pieces that can be effectively processed and retrieved by your AI agents. The right chunking strategy significantly impacts the quality and relevance of responses.
What is Chunking?
Chunking divides long documents into smaller segments (chunks) that:
- Fit within the context window of AI models
- Contain semantically coherent information
- Can be independently retrieved and understood
- Maintain sufficient context for accurate interpretation
Why Chunking Matters
Proper chunking affects several critical aspects of your AI system:
- Retrieval Precision: Smaller, focused chunks help retrieve exactly what's needed
- Context Preservation: Well-chunked content maintains meaning without the full document
- Performance: Optimally sized chunks improve processing speed
- Cost Efficiency: Smaller chunks reduce token usage in LLM calls
Chunking Strategies
Fixed-Size Chunking
Split documents into chunks of a predetermined size.
When to Use:
- Uniform content without clear structural boundaries
- Quick processing with minimal overhead
- Content where exact boundaries are less critical
Parameters:
- Chunk Size: Number of characters or tokens per chunk (e.g., 512, 1000, 2000)
- Overlap: Number of characters/tokens to overlap between chunks (e.g., 50-200)
Pros:
- Simple and fast to implement
- Predictable chunk sizes
- Low computational overhead
Cons:
- May split sentences or paragraphs mid-thought
- Doesn't respect document structure
- Can lose semantic coherence
Example Configuration:
{
"strategy": "fixed-size",
"chunkSize": 1000,
"overlap": 100,
"unit": "characters"
}
Semantic Chunking
Split documents based on semantic meaning and topic boundaries.
When to Use:
- Content with clear topic transitions
- Technical documentation with distinct sections
- Articles and blog posts with well-defined structure
How It Works:
- Analyzes text for semantic similarity
- Identifies topic boundaries using embeddings or NLP
- Creates chunks around natural transition points
Pros:
- Preserves semantic coherence
- Natural, meaningful segments
- Better retrieval accuracy
Cons:
- More computationally intensive
- Variable chunk sizes
- May require fine-tuning
Example Configuration:
{
"strategy": "semantic",
"similarityThreshold": 0.7,
"minChunkSize": 200,
"maxChunkSize": 2000
}
Structural Chunking
Split documents based on their inherent structure (headings, paragraphs, sections).
When to Use:
- Well-structured documents (Markdown, HTML)
- Technical manuals with clear hierarchies
- Documentation with consistent formatting
How It Works:
- Identifies structural elements (h1, h2, paragraphs)
- Chunks based on hierarchy levels
- Maintains document outline
Pros:
- Respects document organization
- Preserves hierarchical context
- Intuitive chunk boundaries
Cons:
- Requires structured input
- Variable chunk sizes
- May create very large or very small chunks
Example Configuration:
{
"strategy": "structural",
"splitLevel": "h2",
"includeParentHeadings": true,
"maxChunkSize": 3000
}
Recursive Character Splitting
Hierarchically split text using multiple separators in order of priority.
When to Use:
- Mixed content types
- When you want to maintain natural boundaries
- General-purpose chunking
How It Works:
- Try splitting by paragraph (\n\n)
- If chunks too large, split by sentence
- If still too large, split by words
- As last resort, split by characters
Pros:
- Flexible and adaptive
- Maintains natural boundaries when possible
- Good general-purpose strategy
Cons:
- More complex logic
- May still need manual tuning
- Variable performance
Example Configuration:
{
"strategy": "recursive",
"separators": ["\n\n", "\n", ". ", " "],
"chunkSize": 1000,
"overlap": 100
}
Token-Based Chunking
Split documents based on token count rather than characters.
When to Use:
- When optimizing for LLM token limits
- Cost-sensitive applications
- Need precise control over API usage
How It Works:
- Uses tokenizer to count actual tokens
- Splits to maintain token budget
- Accounts for model-specific tokenization
Pros:
- Precise token control
- Optimal for API cost management
- Model-aware chunking
Cons:
- Requires tokenizer overhead
- Model-specific implementation
- May not respect semantic boundaries
Example Configuration:
{
"strategy": "token-based",
"maxTokens": 512,
"overlap": 50,
"tokenizer": "gpt-4"
}
Choosing the Right Strategy
Content Type Considerations
| Content Type | Recommended Strategy | Reasoning |
|---|---|---|
| Technical Docs | Structural | Respects hierarchies and code blocks |
| Articles/Blogs | Semantic | Maintains topic coherence |
| FAQs | Structural | Each Q&A is a natural chunk |
| Legal Documents | Recursive | Preserves clauses and paragraphs |
| Code Files | Structural | Respects functions and classes |
| Conversational Data | Fixed-Size | Uniform structure |
Performance Considerations
- Small Chunks (200-500 tokens): Better retrieval precision, more API calls
- Medium Chunks (500-1000 tokens): Balanced approach for most use cases
- Large Chunks (1000-2000 tokens): More context, fewer retrievals, may be less precise
Advanced Techniques
Chunk Overlap
Include overlapping content between adjacent chunks to maintain context continuity.
Benefits:
- Prevents information loss at boundaries
- Improves retrieval of concepts spanning chunks
- Provides additional context
Best Practices:
- Use 10-20% overlap for fixed-size chunks
- Adjust based on content type and chunk size
- Consider computational cost vs. benefit
Metadata Enrichment
Add metadata to chunks for better filtering and context:
{
"chunk": "...",
"metadata": {
"source": "user-manual.pdf",
"section": "Installation",
"page": 15,
"headings": ["Getting Started", "Installation"],
"created_at": "2024-01-15",
"doc_type": "manual"
}
}
Parent-Child Chunking
Create hierarchical chunk relationships:
- Parent Chunks: Larger context chunks (e.g., full sections)
- Child Chunks: Smaller retrievable chunks
- Benefit: Retrieve specific content but have access to broader context
Implementation Guide
Step 1: Analyze Your Content
- Review document structure
- Identify natural boundaries
- Consider content density
- Assess variability
Step 2: Select Initial Strategy
- Start with a recommended strategy for your content type
- Choose conservative chunk sizes
- Enable overlap initially
Step 3: Test and Measure
- Process sample documents
- Review chunk quality
- Test retrieval accuracy
- Measure performance metrics
Step 4: Iterate and Optimize
- Adjust chunk sizes based on results
- Try alternative strategies
- Fine-tune parameters
- Monitor ongoing performance
Common Pitfalls
Chunks Too Small
- Problem: Lost context, too many retrievals
- Solution: Increase chunk size or add overlap
Chunks Too Large
- Problem: Irrelevant information included, slow processing
- Solution: Decrease chunk size or use more granular strategy
Ignoring Structure
- Problem: Split mid-sentence or mid-concept
- Solution: Use structural or semantic chunking
No Overlap
- Problem: Information loss at boundaries
- Solution: Add 10-20% overlap
One-Size-Fits-All
- Problem: Poor performance across different content types
- Solution: Use content-specific strategies
Monitoring Chunk Quality
Track these metrics to ensure optimal chunking:
- Average Chunk Size: Should be consistent with target
- Chunk Size Distribution: Watch for outliers
- Retrieval Accuracy: Measure relevance of retrieved chunks
- User Satisfaction: Track feedback on response quality
- Token Usage: Monitor API costs
Best Practices
- Start Conservative: Begin with medium-sized chunks and adjust
- Respect Boundaries: Don't split sentences or code blocks mid-way
- Add Context: Include headings or section titles in chunks
- Use Metadata: Tag chunks with source, section, and category
- Test Thoroughly: Validate chunking with real queries
- Iterate Regularly: Refine based on performance data
- Document Decisions: Keep track of why you chose specific strategies
Next Steps
- Synthetic Data - Enhance your chunks with generated content
- Data Manipulations - Transform and enrich your chunks
Agent Instructions: Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.
Perform an HTTP GET request on the current page URL with the ask query parameter:
GET /dev/product/data-prep/chunking-strategies.md?ask=<question>
The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.
Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
People also ask
Related Pages
Last updated January 26, 2026


