Rag Scenarios And Solutions
Optimizing Chunk Size
Finding the right chunk size is difficult—too small loses context, too large dilutes relevance, and there's no universal optimal size for all content types.
TL;DR
Finding the right chunk size is difficult—too small loses context, too large dilutes relevance, and there's no universal optimal size for all content types.
Key Takeaways
- The Problem
- Deep Technical Analysis
- How to Solve
- Agent Instructions: Querying This Documentation
The Problem
Finding the right chunk size is difficult—too small loses context, too large dilutes relevance, and there's no universal optimal size for all content types.
Symptoms
- ❌ Constant tuning needed for different documents
- ❌ Technical docs need different size than marketing content
- ❌ Retrieval quality varies wildly
- ❌ One-size-fits-all approach fails
- ❌ Can't balance coverage vs precision
Real-World Example
Current setting: 512 tokens
Works well for:
✓ FAQ entries (naturally ~300 tokens each)
✓ Blog posts (clear paragraphs)
Fails for:
✗ API reference (needs full function signature + examples = 800 tokens)
✗ Legal documents (single sentences = 200 tokens but need surrounding context)
✗ Code files (functions vary 50-2000 tokens)
One setting can't satisfy all content types
Deep Technical Analysis
The Fundamental Trade-Off
Chunk size optimization is inherently a multi-objective problem:
Competing Objectives:
Smaller chunks:
+ Higher precision (exact answer location)
+ More chunks fit in context window
+ Faster embedding generation
- Lose surrounding context
- Miss cross-paragraph relationships
- More storage (more chunks total)
Larger chunks:
+ More context preserved
+ Better paragraph coherence
+ Fewer chunks (less storage)
- Lower precision (answer buried in noise)
- Fewer chunks fit in context
- Diluted semantic signal
No single size optimizes both
Retrieval Metrics Conflict:
Precision: Relevant content / Retrieved content
→ Maximized with small, focused chunks
→ Each chunk highly relevant
Recall: Retrieved relevant / Total relevant
→ Maximized with large chunks
→ Cast wide net, capture more info
F1 Score: Harmonic mean of precision and recall
→ Requires balancing both
→ Optimal point varies by use case
Content-Type Specific Requirements
Different document types have different optimal sizes:
Content Type Analysis:
Technical Documentation:
→ Dense information
→ Code examples (keep complete)
→ Step-by-step procedures
→ Optimal: 1024-1536 tokens
Marketing Content:
→ Conversational style
→ Shorter paragraphs
→ Standalone sections
→ Optimal: 256-512 tokens
Legal/Compliance:
→ Single sentences are important
→ But need clause context
→ Cross-references common
→ Optimal: 512-1024 tokens
Code Repositories:
→ Function-level (varies wildly)
→ 50 tokens (small helper) to 5000 (main class)
→ Optimal: Variable by AST node
Academic Papers:
→ Long-form argumentation
→ Section-level coherence matters
→ Figures/tables reference
→ Optimal: 1536-2048 tokens
The Multi-Dataset Problem:
Organization with mixed content:
→ 10,000 technical docs (want 1024 tokens)
→ 5,000 blog posts (want 512 tokens)
→ 2,000 code files (want AST-based)
→ 1,000 legal docs (want 768 tokens)
Current system: Global chunk_size setting
Options:
1. Use average (768): Suboptimal for all
2. Use smallest (512): Loses context in technical docs
3. Use largest (1024): Too coarse for blogs
4. Content-type detection + dynamic sizing (complex)
Query-Dependent Optimal Size
Different queries benefit from different chunk sizes:
Query Type Variations:
Factual queries: "What's the API rate limit?"
→ Answer in single sentence
→ Optimal: Small chunks (256-512)
→ High precision needed
Explanatory queries: "How does OAuth work?"
→ Multi-paragraph explanation
→ Optimal: Large chunks (1024-2048)
→ Context and flow matter
Comparative queries: "Difference between Pro and Enterprise?"
→ Need multiple pieces of info
→ Optimal: Medium chunks (512-1024)
→ Retrieve multiple, compare
Code queries: "Example of API authentication"
→ Need complete code block
→ Optimal: AST-aware (variable)
→ Syntactic completeness required
The Static Configuration Problem:
Chunk size set at indexing time:
→ chunk_size = 512
Cannot adapt to query at retrieval time:
→ Factual query: Would benefit from 256
→ Explanatory query: Would benefit from 1536
→ But: Already embedded at 512
→ Must re-index to change
Dynamic retrieval-time chunking impossible
(embeddings already generated)
Overlap Configuration Complexity
Overlap percentage interacts with chunk size:
Overlap Mathematics:
Chunk size: 512 tokens
Overlap: 10% = 51 tokens
Chunk 1: Tokens 1-512
Chunk 2: Tokens 461-972 (51 overlap with Chunk 1)
Chunk 3: Tokens 921-1432 (51 overlap with Chunk 2)
Total tokens: 1432
Unique tokens: 1432 - (2 × 51) = 1330
Redundancy: 102 / 1432 = 7.1%
But:
Chunk size: 1024 tokens
Overlap: 10% = 102 tokens
Same document (1432 tokens):
Chunk 1: 1-1024
Chunk 2: 922-1432 (102 overlap)
Total chunks: 2 (vs 3 with 512)
Redundancy: 102 / 1432 = 7.1% (same percentage!)
Semantic Boundary Awareness:
Fixed overlap may split mid-sentence:
Chunk 1 (512 tokens) ends at:
"...the authentication process requires three"
Chunk 2 (10% overlap) starts at:
"process requires three steps: validation, token generation, and verification."
Overlap captures "process requires three" (redundant)
But ideal overlap:
→ Start Chunk 2 at sentence boundary
→ Include full "three steps" sentence in both
→ Variable overlap (not fixed %)
Embedding Model Constraints
Models have inherent size preferences:
Model Context Windows:
OpenAI text-embedding-ada-002:
→ Max input: 8,191 tokens
→ Optimal: 256-512 tokens (per OpenAI docs)
→ Performance degrades with very long inputs
Sentence-BERT models:
→ Max input: 128-512 tokens (model-dependent)
→ Optimal: 64-256 tokens
→ Positional encoding strongest at beginning
Cohere embed-english-v3.0:
→ Max input: 512 tokens
→ Optimal: 256 tokens
→ Explicit recommendation from provider
Chunk size should respect model's sweet spot
Positional Encoding Decay:
Transformer embeddings:
→ Earlier tokens weighted more heavily
→ Later tokens weighted less
→ Attention mechanism has bias
Chunk size 2048 tokens:
→ First 512: Well-represented
→ Middle 1024: Moderately represented
→ Last 512: Poorly represented
Query matching last 512 tokens:
→ Low similarity despite containing answer
→ Effectively wasted tokens
Better: Multiple smaller chunks
→ Each chunk's content well-represented
Hierarchical Chunking Strategies
Different granularities for different purposes:
Multi-Resolution Indexing:
Approach: Embed document at multiple chunk sizes
Document (5000 tokens) indexed as:
→ Small chunks (512): 10 chunks (high precision)
→ Medium chunks (1024): 5 chunks (balanced)
→ Large chunks (2048): 3 chunks (context-rich)
→ Full document: 1 chunk (maximum context)
Retrieval strategy:
1. Query against small chunks (precision)
2. If top matches ambiguous: Query medium chunks
3. If still unclear: Query large chunks
4. Return best-matching granularity
Cost: 19 embeddings (10+5+3+1) vs 10 (single size)
→ 2x storage and compute
Parent-Child Chunking:
Structure:
→ Parent: Section-level (2048 tokens)
→ Children: Paragraph-level (512 tokens)
Storage:
→ Embed and index children only (for retrieval)
→ Store parent metadata with each child
Retrieval:
1. Query matches child chunk (paragraph)
2. Return parent chunk (full section) to LLM
3. LLM sees broader context than matched paragraph
Benefits:
→ Retrieval precision (small chunks)
→ LLM context richness (large chunks)
→ Moderate cost (only index children)
Evaluation and Measurement
Determining optimal size requires metrics:
Offline Evaluation:
Test dataset: 100 query-answer pairs
Experiment: Test chunk sizes 256, 512, 1024, 2048
For each size:
1. Re-chunk knowledge base
2. Re-embed all chunks
3. Run 100 test queries
4. Measure:
- Retrieval accuracy (answer in top-K?)
- Context utilization (% of context window used)
- Answer quality (BLEU/ROUGE score)
Compare results, pick best size
Challenge:
→ Expensive (4 full re-embeddings)
→ Test dataset may not represent real queries
→ Answer quality subjective
A/B Testing in Production:
Approach: Split traffic
50% of users: chunk_size=512
50% of users: chunk_size=1024
Measure:
→ User satisfaction (thumbs up/down)
→ Query response quality
→ Retrieval latency
After 2 weeks: Compare metrics, choose winner
Risks:
→ Degrades experience for 50% during test
→ Requires significant traffic
→ Confounding variables (query difficulty varies)
How to Solve
Start with 512-1024 tokens as baseline + implement content-type detection for variable sizing + use 10-15% overlap + evaluate with test queries + consider hierarchical chunking for complex docs. See Chunk Size Optimization.
Agent Instructions: Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.
Perform an HTTP GET request on the current page URL with the ask query parameter:
GET /dev/rag-scenarios-and-solutions/chunking/optimize-chunks.md?ask=<question>
The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.
Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
Related Pages
Comparisons
Last updated January 26, 2026


