Markdown Formatting Lost

The Problem

Markdown formatting (bold, italic, links, code) is stripped during chunking, losing semantic emphasis and making retrieved content less useful.

Symptoms

❌ code blocks render as plain text
❌ bold emphasis lost in answers
❌ Links converted to bare URLs or text
❌ > Blockquotes not distinguished from body text
❌ Headers (##) indistinguishable from paragraphs

Real-World Example

Original markdown:
## Authentication Methods

We support **three authentication methods**:

1. **API Key**: Include `X-API-Key` header
2. **OAuth 2.0**: Use `/oauth/token` endpoint  
3. **JWT**: Bearer token in `Authorization` header

See [Authentication Guide](/docs/auth) for details.

After naive text extraction:
Authentication Methods We support three authentication methods: API Key: Include X-API-Key header OAuth 2.0: Use /oauth/token endpoint JWT: Bearer token in Authorization header See Authentication Guide for details.

Lost:
- Header hierarchy (##)
- Bold emphasis (**important**)
- Inline code (`code`)
- List structure (1, 2, 3)
- Link destination ([text](url))

Deep Technical Analysis

Semantic Information in Formatting

Markdown formatting carries meaning:

Format Types and Semantics:

**bold** / __bold__:
→ Emphasis, importance, key terms
→ "The **API rate limit** is 1000/hour"
→ "API rate limit" is the critical information

*italic* / _italic_:
→ Subtle emphasis, definitions, foreign terms
→ "The *Bearer* token goes in the header"

`code`:
→ Technical terms, literals, exact syntax
→ "Set `DEBUG=true` in config"
→ Indicates must use exact string

[link text](url):
→ Reference to external resource
→ Anchor text vs destination
→ "See [API docs](https://api.example.com)"

The Loss of Signal:

Original: "Use the **POST** method, not GET"
Extracted: "Use the POST method, not GET"

Without bold:
→ LLM can't tell POST is emphasized
→ Might treat "POST" and "GET" equally
→ User wanted explicit guidance: "Use POST (important!)"

Similarly:
"Configure `max_connections=100` in settings"
→ Becomes: "Configure max_connections=100 in settings"
→ Lost signal that it's a literal config key

Conversion Strategies

Different approaches to handling markdown:

Strategy 1: Strip All Formatting (current)

Markdown → Plain text

Input: "The **API key** goes in the `X-API-Key` header"
Output: "The API key goes in the X-API-Key header"

Pros:
+ Simple, fast
+ Smaller text (fewer tokens)

Cons:
- Loses semantic emphasis
- Code not distinguished from prose
- Links lose destination

Strategy 2: Convert to Natural Language:

Markdown → Descriptive text

Input: "Use the **POST** method"
Output: "Use the POST method (emphasized)"

Input: "Set `DEBUG=true`"
Output: "Set the configuration value DEBUG to true"

Input: "[API docs](https://api.example.com)"
Output: "See API docs at https://api.example.com"

Pros:
+ Preserves semantic meaning
+ More verbose but clearer

Cons:
- Significantly more tokens
- Conversions may be awkward
- Not standard format

Strategy 3: Keep Markdown Syntax:

Markdown → Markdown (as-is)

Input: "Use the **POST** method"
Output: "Use the **POST** method"

Pros:
+ Preserves all formatting
+ LLMs trained on markdown (GPT-4, etc.)
+ Reversible (can render to HTML later)

Cons:
- Extra characters (**, `, etc.)
- May confuse simpler models
- Embeddings see formatting chars as content

Strategy 4: Convert to HTML:

Markdown → HTML

Input: "Use the **POST** method"
Output: "Use the <strong>POST</strong> method"

Input: "Set `DEBUG=true`"
Output: "Set <code>DEBUG=true</code>"

Pros:
+ Standard format
+ Preserves structure
+ Can render in UI

Cons:
- Verbose (<strong>, </strong>, etc.)
- More tokens than markdown
- HTML tags in embeddings

Inline Code vs Code Blocks

Different code representations need different handling:

Inline Code:

Use the `Authorization` header with value `Bearer {token}`.

Preservation Importance:

Without backticks:
"Use the Authorization header with value Bearer {token}."

Problems:
→ "Bearer {token}" might be interpreted as prose
→ LLM might rephrase as "Bearer token"
→ User needs exact literal: "Bearer {token}"

With backticks preserved:
"Use the `Authorization` header with value `Bearer {token}`."

LLM recognizes:
→ These are literal strings
→ Don't paraphrase
→ Use exactly as written

Code Blocks:

```python
def authenticate(api_key):
    headers = {"Authorization": f"Bearer {api_key}"}
    return requests.post(url, headers=headers)
```

Extraction Challenge:

Should we:
1. Keep triple-backticks and language identifier?
   → "```python\ndef authenticate..."
   → Preserves language context
   → But: Extra tokens

2. Strip fences, keep code?
   → "def authenticate(api_key):..."
   → Cleaner, fewer tokens
   → But: Lost language info

3. Add language metadata separately?
   → Text: "def authenticate..."
   → Metadata: {language: "python"}
   → Clean text, structured metadata
   → Requires metadata support

Link Handling and References

Links have both anchor text and destination:

Link Structure:

[GitHub Repository](https://github.com/company/repo)

Extraction Options:

Option 1: Keep anchor text only
→ "GitHub Repository"
→ Readable, concise
→ But: URL lost, can't navigate

Option 2: Keep URL only
→ "https://github.com/company/repo"
→ URL preserved
→ But: Lost descriptive text

Option 3: Expand to "text (url)"
→ "GitHub Repository (https://github.com/company/repo)"
→ Both preserved
→ But: More verbose

Option 4: Keep markdown syntax
→ "[GitHub Repository](https://github.com/company/repo)"
→ Full information
→ But: Extra characters

Reference Link Problem:

See the [API Guide][1] for details.

[1]: https://docs.example.com/api

Reference Resolution:

Reference defined elsewhere in document:
→ May be in different chunk
→ Must resolve before chunking
→ Or: Store unresolved references

Without resolution:
Chunk 1: "See the [API Guide][1] for details."
Chunk 2: "[1]: https://docs.example.com/api"

LLM sees Chunk 1:
→ "[1]" reference not resolved
→ Incomplete information

List Structure Preservation

Lists have hierarchical structure:

Ordered Lists:

1. First step
2. Second step
   1. Sub-step A
   2. Sub-step B
3. Third step

Flattening Problem:

Naive extraction:
"First step Second step Sub-step A Sub-step B Third step"

Lost:
→ Numbering (sequence matters)
→ Hierarchy (2.1, 2.2 are children of 2)
→ Step boundaries

Better extraction:
"Step 1: First step. Step 2: Second step (Sub-step A, Sub-step B). Step 3: Third step."

Preserves:
→ Sequence
→ Hierarchy (parenthetical sub-steps)
→ Boundaries

Unordered Lists:

Features:
- Fast performance
- Easy to use
- Highly scalable

Structure Loss:

Flattened:
"Features: Fast performance Easy to use Highly scalable"

Better:
"Features: (1) Fast performance, (2) Easy to use, (3) Highly scalable"

Or:
"Features include: Fast performance. Easy to use. Highly scalable."

Blockquotes and Callouts

Special blocks carry semantic meaning:

Blockquote:

> **Important**: Always validate user input before processing.

Semantic Significance:

Without formatting:
"Important: Always validate user input before processing."

Lost: Blockquote indicates this is advisory/warning
→ Not just normal prose
→ Carries weight, should be emphasized

With preservation:
"Note (blockquote): Important: Always validate user input before processing."

Or keep markdown:
"> **Important**: Always validate user input before processing."

Callout Boxes (GitHub-flavored):

> [!WARNING]
> This action cannot be undone!

> [!NOTE]
> This is useful information.

Semantic Type Loss:

Extracted as plain text:
"This action cannot be undone! This is useful information."

Lost distinction:
→ WARNING (critical) vs NOTE (informational)
→ Both treated equally

Better:
"Warning: This action cannot be undone."
"Note: This is useful information."

Header Hierarchy

Headers define document structure:

Markdown Headers:

# Main Title
## Section
### Subsection
#### Sub-subsection

Structural Information:

Flattened:
"Main Title Section Subsection Sub-subsection"

All run together, no hierarchy

Better approach:
"Document: Main Title > Section > Subsection > Sub-subsection"

Or preserve markdown:
"# Main Title\n## Section\n### Subsection"

Headers tell LLM:
→ Document structure
→ Topic hierarchy
→ Importance (# > ## > ###)

Embedding Model Considerations

How models handle formatted text:

Training Data Exposure:

LLMs (GPT-4, Claude) trained on:
→ Lots of markdown (GitHub, docs)
→ Understand **bold**, `code`, etc.
→ Can interpret formatting natively

Embedding models (ada-002) trained on:
→ Mixed text (markdown + plain)
→ Uncertain how well they handle formatting

Question:
→ Does "Use the **POST** method" embed differently than "Use the POST method"?
→ Does ** add noise or helpful signal?
→ No clear answer, model-dependent

Token Efficiency:

Plain text: "Set DEBUG to true" (4 tokens)
Markdown: "Set `DEBUG` to `true`" (8 tokens with backticks)

2x tokens for same semantic content
→ Higher embedding cost
→ Fills context window faster

Trade-off:
→ Semantic precision vs token efficiency

How to Solve

Keep markdown syntax intact for LLM consumption + convert special blocks (warnings, notes) to natural language labels + preserve inline code backticks + expand links to "text (URL)" format + maintain list numbering with explicit labels. See Markdown Handling.

Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the ask query parameter:

GET /dev/rag-scenarios-and-solutions/chunking/markdown-lost.md?ask=&lt;question&gt;

The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.

Markdown Formatting Lost

Key Takeaways

The Problem

Symptoms

Real-World Example

Deep Technical Analysis

Semantic Information in Formatting

Conversion Strategies

Inline Code vs Code Blocks

Link Handling and References

List Structure Preservation

Blockquotes and Callouts

Header Hierarchy

Embedding Model Considerations

How to Solve

Agent Instructions: Querying This Documentation

Related Pages

Integrations

Industries

Comparisons

Compliance

Investors

Industry