Tables Breaking Across Chunks

The Problem

Tables are split mid-row or mid-column, making the data incomprehensible and breaking the semantic relationships between table headers and values.

Symptoms

❌ Retrieved chunks show table rows without headers
❌ Table columns split across chunks
❌ AI can't answer questions about tabular data
❌ Pricing tables incomplete in responses
❌ Comparison tables broken and meaningless

Real-World Example

Original table in documentation:

| Plan      | Price | Users | Storage | API Calls |
|-----------|-------|-------|---------|-----------|
| Free      | $0    | 1     | 1GB     | 100/day   |
| Pro       | $49   | 5     | 50GB    | 10K/day   |
| Enterprise| $299  | 50    | 500GB   | 100K/day  |

Chunk boundary falls here ↓

Chunk 1 contains:
| Plan      | Price | Users | Storage | API Calls |
|-----------|-------|-------|---------|-----------|
| Free      | $0    | 1     | 1GB     |

Chunk 2 contains:
| 100/day   |
| Pro       | $49   | 5     | 50GB    | 10K/day   |
| Enterprise| $299  | 50    | 500GB   | 100K/day  |

Result: Headers separated from data, columns misaligned
User query: "What's included in Pro plan?"
AI can't determine which values belong to Pro

Deep Technical Analysis

Table Structure and Boundaries

Tables have inherent structural units:

Table Components:

Header row: Column names
Separator row: Visual divider (Markdown)
Data rows: Actual values
Footer row: Totals/notes (optional)

Semantic unit: Entire table
→ Headers define meaning of values
→ Rows are independent records
→ Columns have consistent types

The Header-Data Dependency:

Without headers, data is meaningless:

Headers: | Name | Age | City |
Data:    | John | 30  | NYC  |

If chunk contains only data row:
→ "John | 30 | NYC"
→ LLM doesn't know: Is "John" a name, product, or category?
→ Is "30" an age, price, or quantity?
→ Context completely lost

Row-Level vs Table-Level Semantics:

Row-level chunking:
→ Each row is self-contained
→ But loses comparative context

Table-level chunking:
→ Keep entire table together
→ But large tables exceed chunk size

Example: Pricing comparison table
→ User asks: "Which plan has most storage?"
→ Needs ALL rows to compare
→ Partial table = incomplete answer

Markdown Table Parsing

Markdown tables have specific syntax:

Format Variations:

Standard Markdown:
| Col1 | Col2 | Col3 |
|------|------|------|
| A    | B    | C    |

No outer pipes:
Col1 | Col2 | Col3
-----|------|-----
A    | B    | C

Alignment markers:
| Left | Center | Right |
|:-----|:------:|------:|
| A    | B      | C     |

Detection Challenges:

Chunker must recognize:
→ Lines starting/ending with |
→ Separator row with dashes
→ Variable spacing/padding
→ Escaped pipes (\|) inside cells

Naive text chunker:
→ Sees lines of text
→ Doesn't recognize table structure
→ Splits mid-table

Result: Broken table syntax

The Cell Content Problem:

Table cell with long content:

| Feature | Description |
|---------|-------------|
| API     | Provides RESTful endpoints for data access. Supports JSON and XML formats. Rate limited to 1000 requests per hour. Requires authentication via OAuth 2.0 or API key. |

Cell content: 200+ characters
→ Might trigger chunk split mid-cell
→ Breaks table row
→ Loses context

HTML Table Complexity

HTML tables add structural depth:

Nested Structure:

<table>
  <thead>
    <tr>
      <th>Plan</th>
      <th>Price</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Free</td>
      <td>$0</td>
    </tr>
    <tr>
      <td>Pro</td>
      <td>$49</td>
    </tr>
  </tbody>
</table>

Parsing Requirements:

Must track:
→ <table> boundaries
→ <thead> vs <tbody> distinction
→ <tr> row boundaries
→ <td> vs <th> cells
→ rowspan and colspan attributes
→ Nested tables

Chunk boundary must not:
→ Split <table>...</table>
→ Separate <thead> from <tbody>
→ Break <tr>...</tr> rows
→ Cut off cells mid-tag

The Rowspan/Colspan Problem:

<table>
  <tr>
    <th rowspan="2">Feature</th>
    <th colspan="2">Limits</th>
  </tr>
  <tr>
    <th>Free</th>
    <th>Pro</th>
  </tr>
  <tr>
    <td>Storage</td>
    <td>1GB</td>
    <td>50GB</td>
  </tr>
</table>

Rowspan="2": Cell spans 2 rows
Colspan="2": Cell spans 2 columns

If chunk splits after first <tr>:
→ Loses rowspan context
→ Misaligned columns in chunk 2
→ Table structure incomprehensible

Table Linearization for Embeddings

Tables must be converted to text:

Flattening Strategies:

Strategy 1: Keep markdown format
→ Embed as-is with pipes and dashes
→ Preserves structure visually
→ But: Less semantic for embeddings

Strategy 2: Linearize to prose
"The Free plan costs $0, includes 1 user, 1GB storage.
 The Pro plan costs $49, includes 5 users, 50GB storage."
→ More natural language
→ Better semantic matching
→ But: Loses tabular structure

Strategy 3: Hybrid
"Pricing table: Free plan ($0), Pro plan ($49), Enterprise ($299).
Storage: Free (1GB), Pro (50GB), Enterprise (500GB)."
→ Structured but readable
→ Preserves comparisons

The Comparison Loss Problem:

Original table enables comparison:
→ "Which plan has better value?"
→ Can see $0 vs $49 vs $299 side-by-side

Linearized text:
→ Comparison harder for LLM
→ Must parse multiple sentences
→ Reconstruct table mentally
→ More error-prone

Responsive and Complex Tables

Modern tables have dynamic layouts:

Multi-Header Tables:

Nested headers:

|           | Q1 Results | Q2 Results |
|           | Revenue | Profit | Revenue | Profit |
|-----------|---------|--------|---------|--------|
| Product A | $100K   | $20K   | $120K   | $25K   |
| Product B | $80K    | $15K   | $90K    | $18K   |

Header hierarchy:
→ Q1/Q2 (parent headers)
→ Revenue/Profit (child headers)

If chunk splits at Q1/Q2 boundary:
→ Loses hierarchical context
→ Revenue/Profit undefined

Pivot Tables and Aggregations:

Summarized data table:

|          | North | South | East | West | Total |
|----------|-------|-------|------|------|-------|
| Q1       | 100   | 80    | 90   | 110  | 380   |
| Q2       | 120   | 85    | 95   | 105  | 405   |
| Q3       | 110   | 90    | 100  | 115  | 415   |
| Q4       | 130   | 95    | 105  | 120  | 450   |
| Total    | 460   | 350   | 390  | 450  | 1650  |

"Total" row depends on all data rows
→ If chunk excludes totals: Incomplete picture
→ If chunk only has totals: No detail breakdown

Large Table Strategies

Tables exceeding chunk size need special handling:

Vertical Splitting (by rows):

Large table: 100 rows × 5 columns

Approach: Keep headers, split rows

Chunk 1:
| Col1 | Col2 | Col3 | Col4 | Col5 |
|------|------|------|------|------|
| Row1 | ...  | ...  | ...  | ...  |
| Row2 | ...  | ...  | ...  | ...  |
...
| Row25| ...  | ...  | ...  | ...  |

Chunk 2:
| Col1 | Col2 | Col3 | Col4 | Col5 |  ← Repeat headers!
|------|------|------|------|------|
| Row26| ...  | ...  | ...  | ...  |
...

Pros: Each chunk has headers
Cons: Header repetition, storage overhead

Horizontal Splitting (by columns):

Wide table: 20 columns × 10 rows

Approach: Split columns, keep rows together

Chunk 1:
| ID | Name | Age | City |
|----|------|-----|------|
| 1  | John | 30  | NYC  |
| 2  | Jane | 25  | LA   |

Chunk 2:
| ID | Email | Phone | Country |
|----|-------|-------|---------|
| 1  | j@... | 555.. | USA     |
| 2  | jane@ | 444.. | USA     |

Requires ID column duplication for joins

Semantic Chunking by Table:

Keep each table as atomic unit:

Document with 5 tables:
→ Chunk 1: Introduction + Table 1
→ Chunk 2: Table 2
→ Chunk 3: Table 3
→ Chunk 4: Tables 4 + 5 (if small enough)

Ensures no table is split
But: Variable chunk sizes

Table Context and Captions

Tables need surrounding context:

Caption and Title:

Document structure:

### Pricing Comparison
The following table shows our pricing tiers:

| Plan | Price |
|------|-------|
| Free | $0    |
| Pro  | $49   |

All plans include 24/7 support.

If chunk only contains table:
→ Missing title "Pricing Comparison"
→ Missing intro context
→ Missing footer note about support
→ LLM doesn't know what table represents

Reference Text:

Scenario:

"As shown in Table 3, Pro plan offers best value."

[500 lines later]

Table 3:
| Plan | Price | Features |
|------|-------|----------|
...

If chunked separately:
→ Reference text in one chunk
→ Table in another chunk
→ "Table 3" reference broken
→ Can't follow cross-reference

How to Solve

Implement table-aware chunking that detects table boundaries (Markdown and HTML) + repeat headers when splitting large tables + keep tables with their captions + linearize to structured text for embeddings. See Table Handling.

Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the ask query parameter:

GET /dev/rag-scenarios-and-solutions/chunking/table-splitting.md?ask=<question>

The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.

Tables Breaking Across Chunks

Key Takeaways

The Problem

Symptoms

Real-World Example

Deep Technical Analysis

Table Structure and Boundaries

Markdown Table Parsing

HTML Table Complexity

Table Linearization for Embeddings

Responsive and Complex Tables

Large Table Strategies

Table Context and Captions

How to Solve

Agent Instructions: Querying This Documentation

Related Pages

Integrations

Industries

Comparisons

Compliance

Investors

Industry