Rag Scenarios And Solutions
Tables Breaking Across Chunks
Tables are split mid-row or mid-column, making the data incomprehensible and breaking the semantic relationships between table headers and values.
TL;DR
Tables are split mid-row or mid-column, making the data incomprehensible and breaking the semantic relationships between table headers and values.
Key Takeaways
- The Problem
- Deep Technical Analysis
- How to Solve
- Agent Instructions: Querying This Documentation
The Problem
Tables are split mid-row or mid-column, making the data incomprehensible and breaking the semantic relationships between table headers and values.
Symptoms
- ❌ Retrieved chunks show table rows without headers
- ❌ Table columns split across chunks
- ❌ AI can't answer questions about tabular data
- ❌ Pricing tables incomplete in responses
- ❌ Comparison tables broken and meaningless
Real-World Example
Original table in documentation:
| Plan | Price | Users | Storage | API Calls |
|-----------|-------|-------|---------|-----------|
| Free | $0 | 1 | 1GB | 100/day |
| Pro | $49 | 5 | 50GB | 10K/day |
| Enterprise| $299 | 50 | 500GB | 100K/day |
Chunk boundary falls here ↓
Chunk 1 contains:
| Plan | Price | Users | Storage | API Calls |
|-----------|-------|-------|---------|-----------|
| Free | $0 | 1 | 1GB |
Chunk 2 contains:
| 100/day |
| Pro | $49 | 5 | 50GB | 10K/day |
| Enterprise| $299 | 50 | 500GB | 100K/day |
Result: Headers separated from data, columns misaligned
User query: "What's included in Pro plan?"
AI can't determine which values belong to Pro
Deep Technical Analysis
Table Structure and Boundaries
Tables have inherent structural units:
Table Components:
Header row: Column names
Separator row: Visual divider (Markdown)
Data rows: Actual values
Footer row: Totals/notes (optional)
Semantic unit: Entire table
→ Headers define meaning of values
→ Rows are independent records
→ Columns have consistent types
The Header-Data Dependency:
Without headers, data is meaningless:
Headers: | Name | Age | City |
Data: | John | 30 | NYC |
If chunk contains only data row:
→ "John | 30 | NYC"
→ LLM doesn't know: Is "John" a name, product, or category?
→ Is "30" an age, price, or quantity?
→ Context completely lost
Row-Level vs Table-Level Semantics:
Row-level chunking:
→ Each row is self-contained
→ But loses comparative context
Table-level chunking:
→ Keep entire table together
→ But large tables exceed chunk size
Example: Pricing comparison table
→ User asks: "Which plan has most storage?"
→ Needs ALL rows to compare
→ Partial table = incomplete answer
Markdown Table Parsing
Markdown tables have specific syntax:
Format Variations:
Standard Markdown:
| Col1 | Col2 | Col3 |
|------|------|------|
| A | B | C |
No outer pipes:
Col1 | Col2 | Col3
-----|------|-----
A | B | C
Alignment markers:
| Left | Center | Right |
|:-----|:------:|------:|
| A | B | C |
Detection Challenges:
Chunker must recognize:
→ Lines starting/ending with |
→ Separator row with dashes
→ Variable spacing/padding
→ Escaped pipes (\|) inside cells
Naive text chunker:
→ Sees lines of text
→ Doesn't recognize table structure
→ Splits mid-table
Result: Broken table syntax
The Cell Content Problem:
Table cell with long content:
| Feature | Description |
|---------|-------------|
| API | Provides RESTful endpoints for data access. Supports JSON and XML formats. Rate limited to 1000 requests per hour. Requires authentication via OAuth 2.0 or API key. |
Cell content: 200+ characters
→ Might trigger chunk split mid-cell
→ Breaks table row
→ Loses context
HTML Table Complexity
HTML tables add structural depth:
Nested Structure:
<table>
<thead>
<tr>
<th>Plan</th>
<th>Price</th>
</tr>
</thead>
<tbody>
<tr>
<td>Free</td>
<td>$0</td>
</tr>
<tr>
<td>Pro</td>
<td>$49</td>
</tr>
</tbody>
</table>
Parsing Requirements:
Must track:
→ <table> boundaries
→ <thead> vs <tbody> distinction
→ <tr> row boundaries
→ <td> vs <th> cells
→ rowspan and colspan attributes
→ Nested tables
Chunk boundary must not:
→ Split <table>...</table>
→ Separate <thead> from <tbody>
→ Break <tr>...</tr> rows
→ Cut off cells mid-tag
The Rowspan/Colspan Problem:
<table>
<tr>
<th rowspan="2">Feature</th>
<th colspan="2">Limits</th>
</tr>
<tr>
<th>Free</th>
<th>Pro</th>
</tr>
<tr>
<td>Storage</td>
<td>1GB</td>
<td>50GB</td>
</tr>
</table>
Rowspan="2": Cell spans 2 rows
Colspan="2": Cell spans 2 columns
If chunk splits after first <tr>:
→ Loses rowspan context
→ Misaligned columns in chunk 2
→ Table structure incomprehensible
Table Linearization for Embeddings
Tables must be converted to text:
Flattening Strategies:
Strategy 1: Keep markdown format
→ Embed as-is with pipes and dashes
→ Preserves structure visually
→ But: Less semantic for embeddings
Strategy 2: Linearize to prose
"The Free plan costs $0, includes 1 user, 1GB storage.
The Pro plan costs $49, includes 5 users, 50GB storage."
→ More natural language
→ Better semantic matching
→ But: Loses tabular structure
Strategy 3: Hybrid
"Pricing table: Free plan ($0), Pro plan ($49), Enterprise ($299).
Storage: Free (1GB), Pro (50GB), Enterprise (500GB)."
→ Structured but readable
→ Preserves comparisons
The Comparison Loss Problem:
Original table enables comparison:
→ "Which plan has better value?"
→ Can see $0 vs $49 vs $299 side-by-side
Linearized text:
→ Comparison harder for LLM
→ Must parse multiple sentences
→ Reconstruct table mentally
→ More error-prone
Responsive and Complex Tables
Modern tables have dynamic layouts:
Multi-Header Tables:
Nested headers:
| | Q1 Results | Q2 Results |
| | Revenue | Profit | Revenue | Profit |
|-----------|---------|--------|---------|--------|
| Product A | $100K | $20K | $120K | $25K |
| Product B | $80K | $15K | $90K | $18K |
Header hierarchy:
→ Q1/Q2 (parent headers)
→ Revenue/Profit (child headers)
If chunk splits at Q1/Q2 boundary:
→ Loses hierarchical context
→ Revenue/Profit undefined
Pivot Tables and Aggregations:
Summarized data table:
| | North | South | East | West | Total |
|----------|-------|-------|------|------|-------|
| Q1 | 100 | 80 | 90 | 110 | 380 |
| Q2 | 120 | 85 | 95 | 105 | 405 |
| Q3 | 110 | 90 | 100 | 115 | 415 |
| Q4 | 130 | 95 | 105 | 120 | 450 |
| Total | 460 | 350 | 390 | 450 | 1650 |
"Total" row depends on all data rows
→ If chunk excludes totals: Incomplete picture
→ If chunk only has totals: No detail breakdown
Large Table Strategies
Tables exceeding chunk size need special handling:
Vertical Splitting (by rows):
Large table: 100 rows × 5 columns
Approach: Keep headers, split rows
Chunk 1:
| Col1 | Col2 | Col3 | Col4 | Col5 |
|------|------|------|------|------|
| Row1 | ... | ... | ... | ... |
| Row2 | ... | ... | ... | ... |
...
| Row25| ... | ... | ... | ... |
Chunk 2:
| Col1 | Col2 | Col3 | Col4 | Col5 | ← Repeat headers!
|------|------|------|------|------|
| Row26| ... | ... | ... | ... |
...
Pros: Each chunk has headers
Cons: Header repetition, storage overhead
Horizontal Splitting (by columns):
Wide table: 20 columns × 10 rows
Approach: Split columns, keep rows together
Chunk 1:
| ID | Name | Age | City |
|----|------|-----|------|
| 1 | John | 30 | NYC |
| 2 | Jane | 25 | LA |
Chunk 2:
| ID | Email | Phone | Country |
|----|-------|-------|---------|
| 1 | j@... | 555.. | USA |
| 2 | jane@ | 444.. | USA |
Requires ID column duplication for joins
Semantic Chunking by Table:
Keep each table as atomic unit:
Document with 5 tables:
→ Chunk 1: Introduction + Table 1
→ Chunk 2: Table 2
→ Chunk 3: Table 3
→ Chunk 4: Tables 4 + 5 (if small enough)
Ensures no table is split
But: Variable chunk sizes
Table Context and Captions
Tables need surrounding context:
Caption and Title:
Document structure:
### Pricing Comparison
The following table shows our pricing tiers:
| Plan | Price |
|------|-------|
| Free | $0 |
| Pro | $49 |
All plans include 24/7 support.
If chunk only contains table:
→ Missing title "Pricing Comparison"
→ Missing intro context
→ Missing footer note about support
→ LLM doesn't know what table represents
Reference Text:
Scenario:
"As shown in Table 3, Pro plan offers best value."
[500 lines later]
Table 3:
| Plan | Price | Features |
|------|-------|----------|
...
If chunked separately:
→ Reference text in one chunk
→ Table in another chunk
→ "Table 3" reference broken
→ Can't follow cross-reference
How to Solve
Implement table-aware chunking that detects table boundaries (Markdown and HTML) + repeat headers when splitting large tables + keep tables with their captions + linearize to structured text for embeddings. See Table Handling.
Agent Instructions: Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.
Perform an HTTP GET request on the current page URL with the ask query parameter:
GET /dev/rag-scenarios-and-solutions/chunking/table-splitting.md?ask=<question>
The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.
Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
Related Pages
Last updated January 26, 2026


