Chunks Too Small

The Problem

Your AI agent gives incomplete or fragmented answers because document chunks are too small and lack sufficient context.

Symptoms

❌ AI says "I don't have enough information" when answer exists
❌ Answers are partial or cut off mid-sentence
❌ References span multiple chunks but AI only cites one
❌ Code examples split across chunks, missing parts
❌ Tables broken, showing only headers without data

Real-World Example

Your documentation has a comprehensive setup guide,
but when asked "How do I set up the database?",
AI only mentions Step 1 and 2 of 5 steps.

Chunk size: 200 tokens
Setup guide: 800 tokens total
Split into: 4 chunks
AI retrieves: Only chunks 1-2
Result: Incomplete answer

Deep Technical Analysis

The Fundamental Chunking Dilemma

Chunking creates a paradox in RAG systems:

Too Small Chunks:
→ Better semantic precision (exact match to query)
→ But loses context (incomplete information)
→ Retrieves accurate but insufficient pieces

Too Large Chunks:
→ Complete context (full information)
→ But lower semantic precision (diluted signal)
→ Retrieves irrelevant information alongside relevant

The Challenge:
There's no universal optimal chunk size
It depends on: content type, query patterns, retrieval strategy

Why Token-Based Chunking Fails for Technical Content

Token-based splitting assumptions:

Assumption: Language has uniform information density
Reality: Technical docs have variable density

Example - Dense section (API reference):
  "authenticate(token: string): Promise<User>
   Parameters:
   - token: JWT authentication token
   Returns: User object or throws AuthError"
  
  Information: 5 distinct concepts in 50 tokens
  Density: 0.1 concepts/token

Example - Sparse section (introduction):
  "Welcome to our platform. This guide will help you
   get started with the basics and understand the
   fundamental concepts you'll need to know."
  
  Information: 1 concept in 50 tokens
  Density: 0.02 concepts/token

Problem:
Fixed token chunks treat both equally
Dense sections get split mid-concept
Sparse sections waste chunk capacity

The Retrieval Mathematics Problem

Why top-K retrieval fails with small chunks:

Vector similarity search returns top-K chunks:
- Query: "How do I set up authentication?"
- K = 5 (typical)
- Chunk size: 200 tokens

Math:
Setup guide total: 1,000 tokens
Split into: 5 chunks of 200 tokens each

Chunks created:
1. "Introduction to auth + step 1"
2. "Step 2 + step 3"
3. "Step 4 + step 5"  
4. "Step 6 + common errors"
5. "Troubleshooting + FAQ"

Vector similarity scores:
Chunk 1: 0.89 (high - mentions "authentication")
Chunk 4: 0.84 (high - mentions "errors" which query implies)
Chunk 2: 0.76 (medium)
Chunk 5: 0.75 (medium - "troubleshooting" related)
Chunk 3: 0.68 (lowest)

Top-5 retrieval gets: 1, 4, 2, 5, 3
But logical reading order is: 1, 2, 3, 4, 5

AI sees: Step 1 → Error handling → Step 2-3 → Troubleshooting → Step 4-5
Coherence: Destroyed

Semantic Boundary Detection Complexity

The code block problem:

# Chunk 1 (200 tokens)
def authenticate_user(credentials):
    """
    Authenticates user with provided credentials.
    
    Args:
        credentials: Dict with 'username' and 'password'
    
    Returns:
        User object if successful
    
    Raises:
        AuthenticationError: If credentials invalid
    """
    # Validate input format
    if not isinstance(credentials, dict):
        raise ValueError("Credentials must be dict")
    
    if 'username' not in credentials:
        raise ValueError("Missing username")

# ← CHUNK BOUNDARY HERE ← 

# Chunk 2 (200 tokens)  
    if 'password' not in credentials:
        raise ValueError("Missing password")
    
    # Hash password
    hashed_pw = hash_password(credentials['password'])
    
    # Query database
    user = db.query(User).filter(
        User.username == credentials['username'],
        User.password_hash == hashed_pw
    ).first()

Why this breaks:

Function signature in chunk 1, implementation in chunk 2
Chunk 1 semantic: "This is about input validation"
Chunk 2 semantic: "This is about database querying"
Query "How to check database for user?" → Retrieves chunk 2 only
Missing context: What the 'credentials' parameter contains
AI can't reconstruct complete logic flow

The cascade effect:

Missing context in code →
AI makes wrong assumptions about parameters →
Generates incorrect usage examples →
User copies broken code →
Support tickets increase →
Trust in AI decreases

Table Splitting Pathology

Markdown table structure:

| Endpoint | Method | Auth Required | Rate Limit |
|----------|--------|---------------|------------|
| /api/users | GET | Yes | 100/min |
| /api/users/:id | GET | Yes | 100/min |
| /api/users | POST | Yes | 20/min |

← CHUNK BOUNDARY AFTER 150 TOKENS ←

| /api/auth/login | POST | No | 10/min |
| /api/auth/refresh | POST | Yes | 50/min |
| /api/data | GET | Yes | 1000/min |

Retrieval scenarios:

Query: "What's the rate limit for user endpoints?"

Chunk 1 retrieved (has header + first 3 rows):
AI sees: User endpoints have 100/min or 20/min limits
Answer: Partially correct

Chunk 2 retrieved (no header, last 3 rows):
AI sees: Row-like data without context
Can't determine: What these rows mean
Answer: "I don't have clear information"

Both chunks retrieved:
AI must: Recognize these are parts of same table
And: Mentally reconstruct table structure
But: No explicit linking between chunks
Result: May still miss connection

Context Window vs Chunk Size Trade-off

The retrieval stage dilemma:

LLM Context Window: 8,192 tokens (typical)

Allocation:
- System prompt: 500 tokens
- User query: 50 tokens  
- Memory/history: 500 tokens
- Reserved for response: 1,500 tokens
- Available for retrieved content: 5,642 tokens

If chunk size = 200 tokens:
  Max chunks that fit: 5,642 / 200 = 28 chunks
  But top-K usually = 5-10 chunks
  Utilization: 1,000-2,000 / 5,642 = 18-35%
  → Wasting context window capacity

If chunk size = 1,000 tokens:
  Max chunks that fit: 5,642 / 1,000 = 5 chunks
  Top-K = 5 chunks
  Utilization: 5,000 / 5,642 = 89%
  → Efficient use of context

But larger chunks mean:
- Lower retrieval precision (more noise per chunk)
- Potentially less relevant content included
- Higher embedding costs

The Overlap Problem

Overlap seems like solution but creates issues:

Document: 1,000 tokens
Chunk size: 400 tokens
Overlap: 100 tokens

Chunks created:
Chunk 1: tokens 0-399
Chunk 2: tokens 300-699 (100 token overlap with chunk 1)
Chunk 3: tokens 600-999 (100 token overlap with chunk 2)

Storage cost:
Without overlap: 1,000 tokens stored
With overlap: 1,200 tokens stored (20% increase)

Embedding cost:
Without overlap: 3 embeddings
With overlap: 3 embeddings (same)

Retrieval confusion:
Query matches overlapped region (tokens 300-399)
→ Both chunk 1 and chunk 2 score highly
→ Top-5 includes both chunks with similar content
→ Wasted retrieval slots on duplicate information
→ Other relevant unique chunks displaced

Hierarchical Document Structure Loss

How chunking destroys document hierarchy:

Original document structure:
# Authentication Guide
## Prerequisites
### System Requirements
### Dependencies
## Setup
### Installation
#### Linux
#### macOS
#### Windows
### Configuration
#### Basic Config
#### Advanced Config
## Usage
### First Login
### Session Management

After fixed-size chunking:
Chunk 1: "# Authentication Guide\n## Prerequisites\n### System Requirements\nYou need..."
Chunk 2: "...Ubuntu 20.04 or later\n### Dependencies\nInstall these packages..."
Chunk 3: "...npm install auth-lib\n## Setup\n### Installation\n#### Linux..."

Lost information:
- Chunk 2 doesn't know it's about "Prerequisites"
- Chunk 3 doesn't know "Linux" is under "Installation" under "Setup"
- Hierarchical context evaporated
- Headings in middle of chunks lack parent context

Query implications:

Query: "What are the authentication prerequisites?"

Semantic match:
- "prerequisites" appears in chunk 1
- But details span chunks 1, 2
- Chunk 2 has most details but weak keyword match
- Retrieval: Chunk 1 scored higher, retrieved alone
- Result: Incomplete list of prerequisites

How to Solve

Increase chunk size to 1024-2048 tokens for technical content + add 10-20% overlap + configure semantic boundary splitting. See Chunking Configuration.

Why This Problem Showcases RAG Architecture Depth

This isn't just "make chunks bigger" - it reveals:

Semantic search limitations: Vector similarity doesn't understand document flow or logical dependencies
Information density variability: Technical content has non-uniform information distribution
Context reconstruction complexity: LLMs must infer structure from fragments
Trade-off mathematics: Chunk size optimization is multi-objective (precision vs recall vs cost vs context)
Structure preservation: Maintaining hierarchical relationships in flat vector space is fundamentally hard

Understanding these architectural constraints is essential for building production RAG systems.

Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the ask query parameter:

GET /dev/rag-scenarios-and-solutions/chunking/chunks-too-small.md?ask=<question>

The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.