Rag Scenarios And Solutions
Code Blocks Split Wrong
Code snippets are split mid-function or mid-block, breaking syntax and making the code incomprehensible in retrieval results.
TL;DR
Code snippets are split mid-function or mid-block, breaking syntax and making the code incomprehensible in retrieval results.
Key Takeaways
- The Problem
- Deep Technical Analysis
- How to Solve
- Agent Instructions: Querying This Documentation
The Problem
Code snippets are split mid-function or mid-block, breaking syntax and making the code incomprehensible in retrieval results.
Symptoms
- ❌ Retrieved code missing opening/closing braces
- ❌ Function definitions split from their bodies
- ❌ Indentation broken across chunks
- ❌ AI generates invalid code based on partial snippets
- ❌ Import statements separated from usage
Real-World Example
Original documentation:
```python
def authenticate_user(username, password):
"""
Authenticates a user with username and password.
Returns JWT token on success.
"""
if not username or not password:
raise ValueError("Credentials required")
user = db.query(User).filter_by(username=username).first()
if not user or not verify_password(password, user.password_hash):
raise AuthenticationError("Invalid credentials")
token = generate_jwt(user.id)
return {"token": token, "expires": 3600}
Chunk boundary at 512 tokens falls here ↓
Chunk 1 ends with:
def authenticate_user(username, password):
"""
Authenticates a user with username and password.
Returns JWT token on success.
"""
if not username or not password:
Chunk 2 starts with:
raise ValueError("Credentials required")
user = db.query(User).filter_by(username=username).first()
Result: Both chunks have broken, syntactically invalid code
---
## Deep Technical Analysis
### AST (Abstract Syntax Tree) Boundaries
Code has natural structural boundaries that must be respected:
**Programming Language Structure:**
Module level: → Imports → Class definitions → Function definitions → Global variables
Class level: → Methods → Properties → Inner classes
Function level: → Function signature → Docstring → Function body → Return statement
Block level: → if/else blocks → try/except blocks → for/while loops
**Naive Text Chunking:**
Standard RAG chunker:
- Count tokens
- Split at token 512
- Repeat
Ignores: → Is this mid-function? → Is this inside a string literal? → Are braces balanced? → Is indentation preserved?
Result: Syntactically broken code
**AST-Aware Chunking:**
Better approach:
- Parse code into AST
- Identify top-level nodes (functions, classes)
- Chunk at node boundaries
Example (Python): ast.parse(code) → Module( body=[ FunctionDef(name='func1', ...), ← Chunk 1 FunctionDef(name='func2', ...), ← Chunk 2 ClassDef(name='MyClass', ...), ← Chunk 3 ] )
Each chunk contains complete, valid syntax unit
**The Multi-Language Problem:**
Need different parsers for each language: → Python: ast module → JavaScript: esprima, acorn → Java: Eclipse JDT, JavaParser → C++: Clang, libclang → Go: go/parser → Rust: syn
Maintenance burden: → 20+ languages to support → Each with unique syntax → Version-specific parsing (Python 2 vs 3) → Dialect support (TypeScript, JSX)
### Indentation and Whitespace Preservation
Code semantics depend on formatting:
**Python Indentation:**
```python
# Original (valid):
def process():
if condition:
result = compute()
return result
# Chunked incorrectly:
Chunk 1:
def process():
if condition:
Chunk 2:
result = compute()
return result ← Wrong indentation level!
The Indentation Context Loss:
Chunk 2 starts mid-block:
→ Missing context: Inside "process" function
→ Missing context: Inside "if" statement
→ Indentation appears wrong without parent context
LLM sees chunk 2:
→ "Why is this indented 8 spaces?"
→ May normalize to 4 spaces (breaking it)
→ Or assume it's a standalone block (wrong)
Language-Specific Rules:
Python: Indentation is syntax
→ 4 spaces vs tabs matters
→ Inconsistent indent = SyntaxError
JavaScript: Indentation is style
→ Braces define blocks
→ Indentation doesn't affect semantics
YAML: Indentation is structure
→ 2-space indent standard
→ Indentation defines nesting
Each requires different handling
Context and Dependencies
Code chunks need surrounding context:
Import Statements:
# Top of file:
import os
import requests
from typing import List, Dict
from .models import User, Session
# ... 500 lines later ...
# Function that uses imports:
def fetch_users() -> List[User]:
response = requests.get(os.getenv("API_URL"))
return [User(**u) for u in response.json()]
Chunking Problem:
Chunk 1 (imports):
import os
import requests
from typing import List, Dict
from .models import User, Session
Chunk 10 (function, 500 lines later):
def fetch_users() -> List[User]:
response = requests.get(os.getenv("API_URL"))
return [User(**u) for u in response.json()]
Query: "How to fetch users from API?"
→ Retrieves Chunk 10 (function)
→ Missing imports (Chunk 1)
→ LLM doesn't know:
- What's "requests"?
- Where does "User" come from?
- What's "os.getenv"?
Answer: Incomplete or wrong imports suggested
The Dependency Chain:
File: auth.py
class AuthService:
def __init__(self, secret_key: str):
self.secret = secret_key
def generate_token(self, user_id: int):
return jwt.encode({"user": user_id}, self.secret)
Later in file:
def login(username, password):
service = AuthService(os.getenv("SECRET"))
# ... auth logic ...
token = service.generate_token(user.id)
Chunking:
Chunk 1: AuthService class
Chunk 2: login function
Query: "How does login work?"
→ Retrieves Chunk 2 (login function)
→ References "AuthService" but definition not in chunk
→ LLM must infer or ask for more context
→ May hallucinate AuthService implementation
Documentation and Code Separation
Code examples in docs need special handling:
Markdown Code Blocks:
To authenticate, use the following code:
```python
import requests
response = requests.post(
"https://api.example.com/auth",
json={"username": "user", "password": "pass"}
)
token = response.json()["token"]
```
Store the token securely and include it in subsequent requests.
The Boundary Problem:
Chunk ends here ↓
Chunk 1:
To authenticate, use the following code:
```python
import requests
response = requests.post(
Chunk 2:
"https://api.example.com/auth",
json={"username": "user", "password": "pass"}
)
token = response.json()["token"]
Store the token securely...
Both chunks: Broken code blocks → Chunk 1: Unclosed triple-backticks → Chunk 2: Starts mid-code-block
**Fenced Code Block Detection:**
Chunker must recognize: → or ~~~ (code fence markers) → Language identifier (python, ```javascript) → Code fence boundaries → Nested code blocks (rare but possible)
Logic:
- Detect opening ```
- Don't chunk until closing ```
- Keep entire code block together
But: → Code block might be 2000 tokens → Exceeds chunk size limit → Must allow splitting WITHIN code block → But intelligently (at function boundaries)
### Multi-File Context
Code often references other files:
**Cross-File Dependencies:**
File: routes.py from .auth import require_auth from .models import User
@require_auth def get_user_profile(user_id): return User.query.get(user_id)
File: auth.py def require_auth(func): # decorator implementation ...
File: models.py class User: # model definition ...
**The Single-File Chunk Limitation:**
Query: "How does user profile route work?"
Retrieved chunk (from routes.py): from .auth import require_auth from .models import User
@require_auth def get_user_profile(user_id): return User.query.get(user_id)
Missing: → What does @require_auth do? (in auth.py) → What fields does User have? (in models.py)
LLM must infer or hallucinate these details
**Graph-Based Chunking (Advanced):**
Ideal approach:
- Build dependency graph: routes.py → auth.py routes.py → models.py
- When chunking routes.py: → Include summaries of auth.py and models.py → Or: Retrieve related files automatically
- Embed with context: "get_user_profile uses @require_auth decorator from auth.py and User model from models.py"
Complexity: → Must parse imports → Resolve relative paths → Handle circular dependencies → Maintain graph for entire codebase
### Inline Comments and Docstrings
Comments provide crucial context:
**Docstring Separation:**
```python
def complex_algorithm(data: List[int]) -> int:
"""
Implements the Knuth-Morris-Pratt algorithm for pattern matching.
Time complexity: O(n + m)
Space complexity: O(m)
Args:
data: Input array of integers
Returns:
Index of pattern match or -1 if not found
"""
# Implementation details...
...
Chunking Issue:
Chunk boundary splits docstring from implementation:
Chunk 1:
def complex_algorithm(data: List[int]) -> int:
Chunk 2:
"""
Implements the Knuth-Morris-Pratt algorithm...
"""
# Implementation...
Or worse:
Chunk 1:
"""
Implements the Knuth-Morris-Pratt algorithm for pattern matching.
Time complexity: O(n + m)
Chunk 2:
Space complexity: O(m)
Args:
data: Input array of integers
"""
Docstring split mid-sentence → loses coherence
How to Solve
Implement AST-based chunking for code blocks + detect language with syntax highlighter + keep function/class definitions intact + preserve indentation context + include parent scope metadata. See Code Chunking.
Agent Instructions: Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.
Perform an HTTP GET request on the current page URL with the ask query parameter:
GET /dev/rag-scenarios-and-solutions/chunking/code-splitting.md?ask=<question>
The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.
Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
Related Pages
Last updated January 26, 2026


