Chunking Strategies

Chunking is the process of breaking down large documents into smaller, manageable pieces that can be effectively processed and retrieved by your AI agents. The right chunking strategy significantly impacts the quality and relevance of responses.

What is Chunking?

Chunking divides long documents into smaller segments (chunks) that:

Fit within the context window of AI models
Contain semantically coherent information
Can be independently retrieved and understood
Maintain sufficient context for accurate interpretation

Why Chunking Matters

Proper chunking affects several critical aspects of your AI system:

Retrieval Precision: Smaller, focused chunks help retrieve exactly what's needed
Context Preservation: Well-chunked content maintains meaning without the full document
Performance: Optimally sized chunks improve processing speed
Cost Efficiency: Smaller chunks reduce token usage in LLM calls

Chunking Strategies

Fixed-Size Chunking

Split documents into chunks of a predetermined size.

When to Use:

Uniform content without clear structural boundaries
Quick processing with minimal overhead
Content where exact boundaries are less critical

Parameters:

Chunk Size: Number of characters or tokens per chunk (e.g., 512, 1000, 2000)
Overlap: Number of characters/tokens to overlap between chunks (e.g., 50-200)

Pros:

Simple and fast to implement
Predictable chunk sizes
Low computational overhead

Cons:

May split sentences or paragraphs mid-thought
Doesn't respect document structure
Can lose semantic coherence

Example Configuration:

{
  "strategy": "fixed-size",
  "chunkSize": 1000,
  "overlap": 100,
  "unit": "characters"
}

Semantic Chunking

Split documents based on semantic meaning and topic boundaries.

When to Use:

Content with clear topic transitions
Technical documentation with distinct sections
Articles and blog posts with well-defined structure

How It Works:

Analyzes text for semantic similarity
Identifies topic boundaries using embeddings or NLP
Creates chunks around natural transition points

Pros:

Preserves semantic coherence
Natural, meaningful segments
Better retrieval accuracy

Cons:

More computationally intensive
Variable chunk sizes
May require fine-tuning

Example Configuration:

{
  "strategy": "semantic",
  "similarityThreshold": 0.7,
  "minChunkSize": 200,
  "maxChunkSize": 2000
}

Structural Chunking

Split documents based on their inherent structure (headings, paragraphs, sections).

When to Use:

Well-structured documents (Markdown, HTML)
Technical manuals with clear hierarchies
Documentation with consistent formatting

How It Works:

Identifies structural elements (h1, h2, paragraphs)
Chunks based on hierarchy levels
Maintains document outline

Pros:

Respects document organization
Preserves hierarchical context
Intuitive chunk boundaries

Cons:

Requires structured input
Variable chunk sizes
May create very large or very small chunks

Example Configuration:

{
  "strategy": "structural",
  "splitLevel": "h2",
  "includeParentHeadings": true,
  "maxChunkSize": 3000
}

Recursive Character Splitting

Hierarchically split text using multiple separators in order of priority.

When to Use:

Mixed content types
When you want to maintain natural boundaries
General-purpose chunking

How It Works:

Try splitting by paragraph (\n\n)
If chunks too large, split by sentence
If still too large, split by words
As last resort, split by characters

Pros:

Flexible and adaptive
Maintains natural boundaries when possible
Good general-purpose strategy

Cons:

More complex logic
May still need manual tuning
Variable performance

Example Configuration:

{
  "strategy": "recursive",
  "separators": ["\n\n", "\n", ". ", " "],
  "chunkSize": 1000,
  "overlap": 100
}

Token-Based Chunking

Split documents based on token count rather than characters.

When to Use:

When optimizing for LLM token limits
Cost-sensitive applications
Need precise control over API usage

How It Works:

Uses tokenizer to count actual tokens
Splits to maintain token budget
Accounts for model-specific tokenization

Pros:

Precise token control
Optimal for API cost management
Model-aware chunking

Cons:

Requires tokenizer overhead
Model-specific implementation
May not respect semantic boundaries

Example Configuration:

{
  "strategy": "token-based",
  "maxTokens": 512,
  "overlap": 50,
  "tokenizer": "gpt-4"
}

Choosing the Right Strategy

Content Type Considerations

Content Type	Recommended Strategy	Reasoning
Technical Docs	Structural	Respects hierarchies and code blocks
Articles/Blogs	Semantic	Maintains topic coherence
FAQs	Structural	Each Q&A is a natural chunk
Legal Documents	Recursive	Preserves clauses and paragraphs
Code Files	Structural	Respects functions and classes
Conversational Data	Fixed-Size	Uniform structure

Performance Considerations

Small Chunks (200-500 tokens): Better retrieval precision, more API calls
Medium Chunks (500-1000 tokens): Balanced approach for most use cases
Large Chunks (1000-2000 tokens): More context, fewer retrievals, may be less precise

Advanced Techniques

Chunk Overlap

Include overlapping content between adjacent chunks to maintain context continuity.

Benefits:

Prevents information loss at boundaries
Improves retrieval of concepts spanning chunks
Provides additional context

Best Practices:

Use 10-20% overlap for fixed-size chunks
Adjust based on content type and chunk size
Consider computational cost vs. benefit

Metadata Enrichment

Add metadata to chunks for better filtering and context:

{
  "chunk": "...",
  "metadata": {
    "source": "user-manual.pdf",
    "section": "Installation",
    "page": 15,
    "headings": ["Getting Started", "Installation"],
    "created_at": "2024-01-15",
    "doc_type": "manual"
  }
}

Parent-Child Chunking

Create hierarchical chunk relationships:

Parent Chunks: Larger context chunks (e.g., full sections)
Child Chunks: Smaller retrievable chunks
Benefit: Retrieve specific content but have access to broader context

Implementation Guide

Step 1: Analyze Your Content

Review document structure
Identify natural boundaries
Consider content density
Assess variability

Step 2: Select Initial Strategy

Start with a recommended strategy for your content type
Choose conservative chunk sizes
Enable overlap initially

Step 3: Test and Measure

Process sample documents
Review chunk quality
Test retrieval accuracy
Measure performance metrics

Step 4: Iterate and Optimize

Adjust chunk sizes based on results
Try alternative strategies
Fine-tune parameters
Monitor ongoing performance

Common Pitfalls

Chunks Too Small

Problem: Lost context, too many retrievals
Solution: Increase chunk size or add overlap

Chunks Too Large

Problem: Irrelevant information included, slow processing
Solution: Decrease chunk size or use more granular strategy

Ignoring Structure

Problem: Split mid-sentence or mid-concept
Solution: Use structural or semantic chunking

No Overlap

Problem: Information loss at boundaries
Solution: Add 10-20% overlap

One-Size-Fits-All

Problem: Poor performance across different content types
Solution: Use content-specific strategies

Monitoring Chunk Quality

Track these metrics to ensure optimal chunking:

Average Chunk Size: Should be consistent with target
Chunk Size Distribution: Watch for outliers
Retrieval Accuracy: Measure relevance of retrieved chunks
User Satisfaction: Track feedback on response quality
Token Usage: Monitor API costs

Best Practices

Start Conservative: Begin with medium-sized chunks and adjust
Respect Boundaries: Don't split sentences or code blocks mid-way
Add Context: Include headings or section titles in chunks
Use Metadata: Tag chunks with source, section, and category
Test Thoroughly: Validate chunking with real queries
Iterate Regularly: Refine based on performance data
Document Decisions: Keep track of why you chose specific strategies

Next Steps

Synthetic Data - Enhance your chunks with generated content
Data Manipulations - Transform and enrich your chunks

Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the ask query parameter:

GET /dev/product/data-prep/chunking-strategies.md?ask=<question>

The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.

Chunking Strategies

Key Takeaways

What is Chunking?

Why Chunking Matters

Chunking Strategies

Fixed-Size Chunking

Semantic Chunking

Structural Chunking

Recursive Character Splitting

Token-Based Chunking

Choosing the Right Strategy

Content Type Considerations

Performance Considerations

Advanced Techniques

Chunk Overlap

Metadata Enrichment

Parent-Child Chunking

Implementation Guide

Step 1: Analyze Your Content

Step 2: Select Initial Strategy

Step 3: Test and Measure

Step 4: Iterate and Optimize

Common Pitfalls

Chunks Too Small

Chunks Too Large

Ignoring Structure

No Overlap

One-Size-Fits-All

Monitoring Chunk Quality

Best Practices

Next Steps

Agent Instructions: Querying This Documentation

People also ask

Related Pages

Integrations

Industries

Comparisons

Compliance

Investors

Industry