Code Blocks Split Wrong

The Problem

Code snippets are split mid-function or mid-block, breaking syntax and making the code incomprehensible in retrieval results.

Symptoms

❌ Retrieved code missing opening/closing braces
❌ Function definitions split from their bodies
❌ Indentation broken across chunks
❌ AI generates invalid code based on partial snippets
❌ Import statements separated from usage

Real-World Example

Original documentation:
```python
def authenticate_user(username, password):
    """
    Authenticates a user with username and password.
    Returns JWT token on success.
    """
    if not username or not password:
        raise ValueError("Credentials required")
    
    user = db.query(User).filter_by(username=username).first()
    if not user or not verify_password(password, user.password_hash):
        raise AuthenticationError("Invalid credentials")
    
    token = generate_jwt(user.id)
    return {"token": token, "expires": 3600}

Chunk boundary at 512 tokens falls here ↓

Chunk 1 ends with:

def authenticate_user(username, password):
    """
    Authenticates a user with username and password.
    Returns JWT token on success.
    """
    if not username or not password:

Chunk 2 starts with:

        raise ValueError("Credentials required")
    
    user = db.query(User).filter_by(username=username).first()

Result: Both chunks have broken, syntactically invalid code


---

## Deep Technical Analysis

### AST (Abstract Syntax Tree) Boundaries

Code has natural structural boundaries that must be respected:

**Programming Language Structure:**

Module level: → Imports → Class definitions → Function definitions → Global variables

Class level: → Methods → Properties → Inner classes

Function level: → Function signature → Docstring → Function body → Return statement

Block level: → if/else blocks → try/except blocks → for/while loops


**Naive Text Chunking:**

Standard RAG chunker:

Count tokens
Split at token 512
Repeat

Ignores: → Is this mid-function? → Is this inside a string literal? → Are braces balanced? → Is indentation preserved?

Result: Syntactically broken code


**AST-Aware Chunking:**

Better approach:

Parse code into AST
Identify top-level nodes (functions, classes)
Chunk at node boundaries

Example (Python): ast.parse(code) → Module( body=[ FunctionDef(name='func1', ...), ← Chunk 1 FunctionDef(name='func2', ...), ← Chunk 2 ClassDef(name='MyClass', ...), ← Chunk 3 ] )

Each chunk contains complete, valid syntax unit


**The Multi-Language Problem:**

Need different parsers for each language: → Python: ast module → JavaScript: esprima, acorn → Java: Eclipse JDT, JavaParser → C++: Clang, libclang → Go: go/parser → Rust: syn

Maintenance burden: → 20+ languages to support → Each with unique syntax → Version-specific parsing (Python 2 vs 3) → Dialect support (TypeScript, JSX)


### Indentation and Whitespace Preservation

Code semantics depend on formatting:

**Python Indentation:**
```python
# Original (valid):
def process():
    if condition:
        result = compute()
        return result

# Chunked incorrectly:
Chunk 1:
def process():
    if condition:

Chunk 2:
        result = compute()
        return result  ← Wrong indentation level!

The Indentation Context Loss:

Chunk 2 starts mid-block:
→ Missing context: Inside "process" function
→ Missing context: Inside "if" statement
→ Indentation appears wrong without parent context

LLM sees chunk 2:
→ "Why is this indented 8 spaces?"
→ May normalize to 4 spaces (breaking it)
→ Or assume it's a standalone block (wrong)

Language-Specific Rules:

Python: Indentation is syntax
→ 4 spaces vs tabs matters
→ Inconsistent indent = SyntaxError

JavaScript: Indentation is style
→ Braces define blocks
→ Indentation doesn't affect semantics

YAML: Indentation is structure
→ 2-space indent standard
→ Indentation defines nesting

Each requires different handling

Context and Dependencies

Code chunks need surrounding context:

Import Statements:

# Top of file:
import os
import requests
from typing import List, Dict
from .models import User, Session

# ... 500 lines later ...

# Function that uses imports:
def fetch_users() -> List[User]:
    response = requests.get(os.getenv("API_URL"))
    return [User(**u) for u in response.json()]

Chunking Problem:

Chunk 1 (imports):
import os
import requests
from typing import List, Dict
from .models import User, Session

Chunk 10 (function, 500 lines later):
def fetch_users() -> List[User]:
    response = requests.get(os.getenv("API_URL"))
    return [User(**u) for u in response.json()]

Query: "How to fetch users from API?"
→ Retrieves Chunk 10 (function)
→ Missing imports (Chunk 1)
→ LLM doesn't know:
  - What's "requests"?
  - Where does "User" come from?
  - What's "os.getenv"?

Answer: Incomplete or wrong imports suggested

The Dependency Chain:

File: auth.py

class AuthService:
    def __init__(self, secret_key: str):
        self.secret = secret_key
    
    def generate_token(self, user_id: int):
        return jwt.encode({"user": user_id}, self.secret)

Later in file:

def login(username, password):
    service = AuthService(os.getenv("SECRET"))
    # ... auth logic ...
    token = service.generate_token(user.id)

Chunking:
Chunk 1: AuthService class
Chunk 2: login function

Query: "How does login work?"
→ Retrieves Chunk 2 (login function)
→ References "AuthService" but definition not in chunk
→ LLM must infer or ask for more context
→ May hallucinate AuthService implementation

Documentation and Code Separation

Code examples in docs need special handling:

Markdown Code Blocks:

To authenticate, use the following code:

```python
import requests

response = requests.post(
    "https://api.example.com/auth",
    json={"username": "user", "password": "pass"}
)
token = response.json()["token"]
```

Store the token securely and include it in subsequent requests.

The Boundary Problem:

Chunk ends here ↓

Chunk 1:
To authenticate, use the following code:

```python
import requests

response = requests.post(

Chunk 2:
    "https://api.example.com/auth",
    json={"username": "user", "password": "pass"}
)
token = response.json()["token"]

Store the token securely...

Both chunks: Broken code blocks → Chunk 1: Unclosed triple-backticks → Chunk 2: Starts mid-code-block


**Fenced Code Block Detection:**

Chunker must recognize: → or ~~~ (code fence markers) → Language identifier (python, ```javascript) → Code fence boundaries → Nested code blocks (rare but possible)

Logic:

Detect opening ```
Don't chunk until closing ```
Keep entire code block together

But: → Code block might be 2000 tokens → Exceeds chunk size limit → Must allow splitting WITHIN code block → But intelligently (at function boundaries)


### Multi-File Context

Code often references other files:

**Cross-File Dependencies:**

File: routes.py from .auth import require_auth from .models import User

@require_auth def get_user_profile(user_id): return User.query.get(user_id)

File: auth.py def require_auth(func): # decorator implementation ...

File: models.py class User: # model definition ...


**The Single-File Chunk Limitation:**

Query: "How does user profile route work?"

Retrieved chunk (from routes.py): from .auth import require_auth from .models import User

@require_auth def get_user_profile(user_id): return User.query.get(user_id)

Missing: → What does @require_auth do? (in auth.py) → What fields does User have? (in models.py)

LLM must infer or hallucinate these details


**Graph-Based Chunking (Advanced):**

Ideal approach:

Build dependency graph: routes.py → auth.py routes.py → models.py
When chunking routes.py: → Include summaries of auth.py and models.py → Or: Retrieve related files automatically
Embed with context: "get_user_profile uses @require_auth decorator from auth.py and User model from models.py"

Complexity: → Must parse imports → Resolve relative paths → Handle circular dependencies → Maintain graph for entire codebase


### Inline Comments and Docstrings

Comments provide crucial context:

**Docstring Separation:**
```python
def complex_algorithm(data: List[int]) -> int:
    """
    Implements the Knuth-Morris-Pratt algorithm for pattern matching.
    
    Time complexity: O(n + m)
    Space complexity: O(m)
    
    Args:
        data: Input array of integers
    
    Returns:
        Index of pattern match or -1 if not found
    """
    # Implementation details...
    ...

Chunking Issue:

Chunk boundary splits docstring from implementation:

Chunk 1:
def complex_algorithm(data: List[int]) -> int:

Chunk 2:
    """
    Implements the Knuth-Morris-Pratt algorithm...
    """
    # Implementation...

Or worse:

Chunk 1:
    """
    Implements the Knuth-Morris-Pratt algorithm for pattern matching.
    
    Time complexity: O(n + m)

Chunk 2:
    Space complexity: O(m)
    
    Args:
        data: Input array of integers
    """

Docstring split mid-sentence → loses coherence

How to Solve

Implement AST-based chunking for code blocks + detect language with syntax highlighter + keep function/class definitions intact + preserve indentation context + include parent scope metadata. See Code Chunking.

Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the ask query parameter:

GET /dev/rag-scenarios-and-solutions/chunking/code-splitting.md?ask=<question>

The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.

Code Blocks Split Wrong

Key Takeaways

The Problem

Symptoms

Real-World Example

Context and Dependencies

Documentation and Code Separation

How to Solve

Agent Instructions: Querying This Documentation

Related Pages

Integrations

Industries

Comparisons

Compliance

Investors

Industry