Optimizing Chunk Size

The Problem

Finding the right chunk size is difficult—too small loses context, too large dilutes relevance, and there's no universal optimal size for all content types.

Symptoms

❌ Constant tuning needed for different documents
❌ Technical docs need different size than marketing content
❌ Retrieval quality varies wildly
❌ One-size-fits-all approach fails
❌ Can't balance coverage vs precision

Real-World Example

Current setting: 512 tokens

Works well for:
✓ FAQ entries (naturally ~300 tokens each)
✓ Blog posts (clear paragraphs)

Fails for:
✗ API reference (needs full function signature + examples = 800 tokens)
✗ Legal documents (single sentences = 200 tokens but need surrounding context)
✗ Code files (functions vary 50-2000 tokens)

One setting can't satisfy all content types

Deep Technical Analysis

The Fundamental Trade-Off

Chunk size optimization is inherently a multi-objective problem:

Competing Objectives:

Smaller chunks:
+ Higher precision (exact answer location)
+ More chunks fit in context window
+ Faster embedding generation
- Lose surrounding context
- Miss cross-paragraph relationships
- More storage (more chunks total)

Larger chunks:
+ More context preserved
+ Better paragraph coherence
+ Fewer chunks (less storage)
- Lower precision (answer buried in noise)
- Fewer chunks fit in context
- Diluted semantic signal

No single size optimizes both

Retrieval Metrics Conflict:

Precision: Relevant content / Retrieved content
→ Maximized with small, focused chunks
→ Each chunk highly relevant

Recall: Retrieved relevant / Total relevant
→ Maximized with large chunks
→ Cast wide net, capture more info

F1 Score: Harmonic mean of precision and recall
→ Requires balancing both
→ Optimal point varies by use case

Content-Type Specific Requirements

Different document types have different optimal sizes:

Content Type Analysis:

Technical Documentation:
→ Dense information
→ Code examples (keep complete)
→ Step-by-step procedures
→ Optimal: 1024-1536 tokens

Marketing Content:
→ Conversational style
→ Shorter paragraphs
→ Standalone sections
→ Optimal: 256-512 tokens

Legal/Compliance:
→ Single sentences are important
→ But need clause context
→ Cross-references common
→ Optimal: 512-1024 tokens

Code Repositories:
→ Function-level (varies wildly)
→ 50 tokens (small helper) to 5000 (main class)
→ Optimal: Variable by AST node

Academic Papers:
→ Long-form argumentation
→ Section-level coherence matters
→ Figures/tables reference
→ Optimal: 1536-2048 tokens

The Multi-Dataset Problem:

Organization with mixed content:
→ 10,000 technical docs (want 1024 tokens)
→ 5,000 blog posts (want 512 tokens)
→ 2,000 code files (want AST-based)
→ 1,000 legal docs (want 768 tokens)

Current system: Global chunk_size setting

Options:
1. Use average (768): Suboptimal for all
2. Use smallest (512): Loses context in technical docs
3. Use largest (1024): Too coarse for blogs
4. Content-type detection + dynamic sizing (complex)

Query-Dependent Optimal Size

Different queries benefit from different chunk sizes:

Query Type Variations:

Factual queries: "What's the API rate limit?"
→ Answer in single sentence
→ Optimal: Small chunks (256-512)
→ High precision needed

Explanatory queries: "How does OAuth work?"
→ Multi-paragraph explanation
→ Optimal: Large chunks (1024-2048)
→ Context and flow matter

Comparative queries: "Difference between Pro and Enterprise?"
→ Need multiple pieces of info
→ Optimal: Medium chunks (512-1024)
→ Retrieve multiple, compare

Code queries: "Example of API authentication"
→ Need complete code block
→ Optimal: AST-aware (variable)
→ Syntactic completeness required

The Static Configuration Problem:

Chunk size set at indexing time:
→ chunk_size = 512

Cannot adapt to query at retrieval time:
→ Factual query: Would benefit from 256
→ Explanatory query: Would benefit from 1536
→ But: Already embedded at 512
→ Must re-index to change

Dynamic retrieval-time chunking impossible
(embeddings already generated)

Overlap Configuration Complexity

Overlap percentage interacts with chunk size:

Overlap Mathematics:

Chunk size: 512 tokens
Overlap: 10% = 51 tokens

Chunk 1: Tokens 1-512
Chunk 2: Tokens 461-972 (51 overlap with Chunk 1)
Chunk 3: Tokens 921-1432 (51 overlap with Chunk 2)

Total tokens: 1432
Unique tokens: 1432 - (2 × 51) = 1330
Redundancy: 102 / 1432 = 7.1%

But:
Chunk size: 1024 tokens
Overlap: 10% = 102 tokens

Same document (1432 tokens):
Chunk 1: 1-1024
Chunk 2: 922-1432 (102 overlap)

Total chunks: 2 (vs 3 with 512)
Redundancy: 102 / 1432 = 7.1% (same percentage!)

Semantic Boundary Awareness:

Fixed overlap may split mid-sentence:

Chunk 1 (512 tokens) ends at:
"...the authentication process requires three"

Chunk 2 (10% overlap) starts at:
"process requires three steps: validation, token generation, and verification."

Overlap captures "process requires three" (redundant)
But ideal overlap:
→ Start Chunk 2 at sentence boundary
→ Include full "three steps" sentence in both
→ Variable overlap (not fixed %)

Embedding Model Constraints

Models have inherent size preferences:

Model Context Windows:

OpenAI text-embedding-ada-002:
→ Max input: 8,191 tokens
→ Optimal: 256-512 tokens (per OpenAI docs)
→ Performance degrades with very long inputs

Sentence-BERT models:
→ Max input: 128-512 tokens (model-dependent)
→ Optimal: 64-256 tokens
→ Positional encoding strongest at beginning

Cohere embed-english-v3.0:
→ Max input: 512 tokens
→ Optimal: 256 tokens
→ Explicit recommendation from provider

Chunk size should respect model's sweet spot

Positional Encoding Decay:

Transformer embeddings:
→ Earlier tokens weighted more heavily
→ Later tokens weighted less
→ Attention mechanism has bias

Chunk size 2048 tokens:
→ First 512: Well-represented
→ Middle 1024: Moderately represented
→ Last 512: Poorly represented

Query matching last 512 tokens:
→ Low similarity despite containing answer
→ Effectively wasted tokens

Better: Multiple smaller chunks
→ Each chunk's content well-represented

Hierarchical Chunking Strategies

Different granularities for different purposes:

Multi-Resolution Indexing:

Approach: Embed document at multiple chunk sizes

Document (5000 tokens) indexed as:
→ Small chunks (512): 10 chunks (high precision)
→ Medium chunks (1024): 5 chunks (balanced)
→ Large chunks (2048): 3 chunks (context-rich)
→ Full document: 1 chunk (maximum context)

Retrieval strategy:
1. Query against small chunks (precision)
2. If top matches ambiguous: Query medium chunks
3. If still unclear: Query large chunks
4. Return best-matching granularity

Cost: 19 embeddings (10+5+3+1) vs 10 (single size)
→ 2x storage and compute

Parent-Child Chunking:

Structure:
→ Parent: Section-level (2048 tokens)
→ Children: Paragraph-level (512 tokens)

Storage:
→ Embed and index children only (for retrieval)
→ Store parent metadata with each child

Retrieval:
1. Query matches child chunk (paragraph)
2. Return parent chunk (full section) to LLM
3. LLM sees broader context than matched paragraph

Benefits:
→ Retrieval precision (small chunks)
→ LLM context richness (large chunks)
→ Moderate cost (only index children)

Evaluation and Measurement

Determining optimal size requires metrics:

Offline Evaluation:

Test dataset: 100 query-answer pairs

Experiment: Test chunk sizes 256, 512, 1024, 2048

For each size:
1. Re-chunk knowledge base
2. Re-embed all chunks
3. Run 100 test queries
4. Measure: 
   - Retrieval accuracy (answer in top-K?)
   - Context utilization (% of context window used)
   - Answer quality (BLEU/ROUGE score)

Compare results, pick best size

Challenge:
→ Expensive (4 full re-embeddings)
→ Test dataset may not represent real queries
→ Answer quality subjective

A/B Testing in Production:

Approach: Split traffic

50% of users: chunk_size=512
50% of users: chunk_size=1024

Measure:
→ User satisfaction (thumbs up/down)
→ Query response quality
→ Retrieval latency

After 2 weeks: Compare metrics, choose winner

Risks:
→ Degrades experience for 50% during test
→ Requires significant traffic
→ Confounding variables (query difficulty varies)

How to Solve

Start with 512-1024 tokens as baseline + implement content-type detection for variable sizing + use 10-15% overlap + evaluate with test queries + consider hierarchical chunking for complex docs. See Chunk Size Optimization.

Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the ask query parameter:

GET /dev/rag-scenarios-and-solutions/chunking/optimize-chunks.md?ask=<question>

The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.