Rag Scenarios And Solutions
Multi-Column Layout Issues
Documents with multi-column layouts (newspapers, academic papers, brochures) have text extracted in wrong order, mixing columns and destroying readability.
TL;DR
Documents with multi-column layouts (newspapers, academic papers, brochures) have text extracted in wrong order, mixing columns and destroying readability.
Key Takeaways
- The Problem
- Deep Technical Analysis
- How to Solve
- Agent Instructions: Querying This Documentation
The Problem
Documents with multi-column layouts (newspapers, academic papers, brochures) have text extracted in wrong order, mixing columns and destroying readability.
Symptoms
- ❌ Text reads across columns instead of down
- ❌ Sentences interleaved from different columns
- ❌ Sidebars mixed into main content
- ❌ Reading order completely wrong
- ❌ Incomprehensible extracted text
Real-World Example
Two-column academic paper:
Column 1: Column 2:
"The algorithm "performance of 95%
processes data in accuracy with low
O(n log n) time, latency. Future work
achieving will explore..."
Naive left-to-right extraction:
"The algorithm performance of 95%
processes data in accuracy with low
O(n log n) time, latency. Future work
achieving will explore..."
Sentences from different columns interleaved
Completely unreadable
Deep Technical Analysis
Column Detection Algorithms
Identifying columns from layout is heuristic-based:
White Space Analysis:
Document page scan:
→ Detect vertical white space strips
→ Width > threshold (e.g., 20px)
→ Height spans most of page
→ Infer: Column boundary
Fails when:
→ Columns have justified text (minimal gaps)
→ Images span columns
→ Uneven column lengths
→ Narrow margins between columns
Text Block Clustering:
Algorithm:
1. Extract all text blocks with coordinates
2. Cluster by X-position similarity
3. Group into column regions
4. Sort by Y-position within each column
Challenges:
→ Indented paragraphs look like new column
→ Block quotes offset from main text
→ Footnotes at page bottom
→ Headers/footers span all columns
The Reading Order Problem:
Three possible reading orders for 2 columns:
Order 1: Down Column 1, then Down Column 2
[A] [D]
[B] [E]
[C] [F]
Reading: A→B→C→D→E→F ✓
Order 2: Across then Down (wrong!)
[A] [B]
[C] [D]
[E] [F]
Reading: A→B→C→D→E→F ✗
Order 3: Z-pattern (very wrong!)
[A] [B]
[C] [D]
Reading: A→B→D→C ✗
Must detect correct pattern
Sidebar and Inset Handling
Additional content boxes complicate layout:
Sidebar Positioning:
Layout:
┌──────────────┬──────┐
│ Main content │ Side │
│ │ bar │
│ ├──────┤
│ │ Ad │
└──────────────┴──────┘
Reading order should be:
1. All main content
2. Sidebar
3. Ad
Not:
1. Main paragraph 1
2. Sidebar (interrupts!)
3. Main paragraph 2
4. Ad
5. Main paragraph 3
Inset Boxes:
Text flow with callout box:
Main text flows around │ ┌─────────┐ │
the callout box that │ │ CALLOUT │ │
appears to the right │ │ box │ │
and continues after it │ └─────────┘ │
Extraction challenge:
→ Callout in middle of paragraph
→ Should it be inline or separate?
→ Does text flow around it or is it independent?
Wrong extraction:
"Main text flows around CALLOUT box the callout..."
→ Callout inserted mid-sentence
Academic Paper Layouts
Scientific papers have complex structures:
Two-Column Abstract:
┌─────────────────────────┐
│ TITLE │
│ Authors, Affiliation │
├────────────┬────────────┤
│ Abstract │ Abstract │
│ (spans │ continued │
│ 2 cols) │ here) │
├────────────┴────────────┤
│ Intro │ Methods │ Results│
Extraction Issues:
Abstract spans 2 columns:
→ Must read left column fully
→ Then right column
→ Not line-by-line across
Sections (Intro, Methods, Results):
→ May be in single column each
→ Or each spans 2 columns
→ Layout varies by paper
Cannot use simple heuristic
→ Need per-document analysis
Footnotes and References:
Main text in 2 columns:
[Content... ¹]
Bottom of page (spanning both columns):
────────────────────────
¹ Footnote text here
Extraction must:
→ Detect footnote marker in main text
→ Find matching footnote at bottom
→ Associate reference with marker
→ Not treat footnote as 3rd column
Magazine and Newsletter Layouts
Creative layouts are unpredictable:
Non-Uniform Columns:
┌─────┬─────────┬───────┐
│ │ │ Side │
│ Img │ Content │ bar │
│ │ ├───────┤
├─────┴─────────┤ Ad │
│ Caption │ │
└───────────────┴───────┘
Columns of different widths
Image caption spans 2 columns
Sidebar changes mid-page
The Unpredictability:
Each page may have:
→ Different number of columns (1, 2, 3)
→ Variable column widths
→ Images breaking grid
→ Text wrapping around shapes
No consistent pattern
→ Per-page layout detection needed
→ Or: Give up on perfect ordering
→ Accept some errors
PDF Coordinate Systems
PDFs use absolute positioning:
Text Positioning:
PDF stores:
→ "Hello" at (x=100, y=200)
→ "World" at (x=400, y=200)
No inherent reading order
→ Just x,y coordinates
→ Must infer order from positions
The Sorting Problem:
Sort by Y (top to bottom):
→ Reads across page first
→ Wrong for multi-column
Sort by X, then Y:
→ Reads column by column
→ But: All Column 1, then all Column 2
→ Doesn't handle column boundaries
Hybrid approach:
→ Cluster by X (identify columns)
→ Sort each column by Y
→ Concatenate columns in order
→ But: Clustering non-trivial
Text Reflow and Reflowable PDFs
Some PDFs support reflow:
Reflowable vs Fixed Layout:
Reflowable PDF:
→ Contains logical structure (tags)
→ Text can adapt to window width
→ Extraction follows logical order
Fixed Layout PDF:
→ Absolute positioning only
→ No logical structure
→ Must infer reading order
Most PDFs: Fixed layout
→ Harder to extract correctly
Column Break Indicators
Documents may hint at columns:
Column Break Characters:
Some documents include:
→ Column break marker (rare)
→ Page break indicator
More common:
→ Continuous text, no markers
→ Must infer from layout alone
The Continuation Problem:
Sentence split across columns:
Column 1 ends: "The results indicate that performance"
Column 2 starts: "improvements are statistically significant."
Must recognize:
→ Sentence continues in next column
→ Not a new sentence
→ Join without adding period or space
Wrong joining:
"The results indicate that performance. Improvements are statistically significant."
→ Added period (wrong!)
Or:
"The results indicate that performanceimprovements are..."
→ Missing space (wrong!)
How to Solve
Implement X-axis clustering to detect columns + use white space analysis for column boundaries + sort by column first, then Y-position within column + handle sidebars separately from main flow + preserve sentence continuity across column breaks. See Multi-Column Layout.
Agent Instructions: Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.
Perform an HTTP GET request on the current page URL with the ask query parameter:
GET /dev/rag-scenarios-and-solutions/chunking/multi-column.md?ask=<question>
The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.
Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
Related Pages
Last updated January 26, 2026


