Rag Scenarios And Solutions
Markdown Formatting Lost
Markdown formatting (bold, italic, links, code) is stripped during chunking, losing semantic emphasis and making retrieved content less useful.
TL;DR
Markdown formatting (bold, italic, links, code) is stripped during chunking, losing semantic emphasis and making retrieved content less useful.
Key Takeaways
- The Problem
- Authentication Methods
- Deep Technical Analysis
- Section
- How to Solve
- Agent Instructions: Querying This Documentation
The Problem
Markdown formatting (bold, italic, links, code) is stripped during chunking, losing semantic emphasis and making retrieved content less useful.
Symptoms
- ❌
codeblocks render as plain text - ❌ bold emphasis lost in answers
- ❌ Links converted to bare URLs or text
- ❌ > Blockquotes not distinguished from body text
- ❌ Headers (##) indistinguishable from paragraphs
Real-World Example
Original markdown:
## Authentication Methods
We support **three authentication methods**:
1. **API Key**: Include `X-API-Key` header
2. **OAuth 2.0**: Use `/oauth/token` endpoint
3. **JWT**: Bearer token in `Authorization` header
See [Authentication Guide](/docs/auth) for details.
After naive text extraction:
Authentication Methods We support three authentication methods: API Key: Include X-API-Key header OAuth 2.0: Use /oauth/token endpoint JWT: Bearer token in Authorization header See Authentication Guide for details.
Lost:
- Header hierarchy (##)
- Bold emphasis (**important**)
- Inline code (`code`)
- List structure (1, 2, 3)
- Link destination ([text](url))
Deep Technical Analysis
Semantic Information in Formatting
Markdown formatting carries meaning:
Format Types and Semantics:
**bold** / __bold__:
→ Emphasis, importance, key terms
→ "The **API rate limit** is 1000/hour"
→ "API rate limit" is the critical information
*italic* / _italic_:
→ Subtle emphasis, definitions, foreign terms
→ "The *Bearer* token goes in the header"
`code`:
→ Technical terms, literals, exact syntax
→ "Set `DEBUG=true` in config"
→ Indicates must use exact string
[link text](url):
→ Reference to external resource
→ Anchor text vs destination
→ "See [API docs](https://api.example.com)"
The Loss of Signal:
Original: "Use the **POST** method, not GET"
Extracted: "Use the POST method, not GET"
Without bold:
→ LLM can't tell POST is emphasized
→ Might treat "POST" and "GET" equally
→ User wanted explicit guidance: "Use POST (important!)"
Similarly:
"Configure `max_connections=100` in settings"
→ Becomes: "Configure max_connections=100 in settings"
→ Lost signal that it's a literal config key
Conversion Strategies
Different approaches to handling markdown:
Strategy 1: Strip All Formatting (current)
Markdown → Plain text
Input: "The **API key** goes in the `X-API-Key` header"
Output: "The API key goes in the X-API-Key header"
Pros:
+ Simple, fast
+ Smaller text (fewer tokens)
Cons:
- Loses semantic emphasis
- Code not distinguished from prose
- Links lose destination
Strategy 2: Convert to Natural Language:
Markdown → Descriptive text
Input: "Use the **POST** method"
Output: "Use the POST method (emphasized)"
Input: "Set `DEBUG=true`"
Output: "Set the configuration value DEBUG to true"
Input: "[API docs](https://api.example.com)"
Output: "See API docs at https://api.example.com"
Pros:
+ Preserves semantic meaning
+ More verbose but clearer
Cons:
- Significantly more tokens
- Conversions may be awkward
- Not standard format
Strategy 3: Keep Markdown Syntax:
Markdown → Markdown (as-is)
Input: "Use the **POST** method"
Output: "Use the **POST** method"
Pros:
+ Preserves all formatting
+ LLMs trained on markdown (GPT-4, etc.)
+ Reversible (can render to HTML later)
Cons:
- Extra characters (**, `, etc.)
- May confuse simpler models
- Embeddings see formatting chars as content
Strategy 4: Convert to HTML:
Markdown → HTML
Input: "Use the **POST** method"
Output: "Use the <strong>POST</strong> method"
Input: "Set `DEBUG=true`"
Output: "Set <code>DEBUG=true</code>"
Pros:
+ Standard format
+ Preserves structure
+ Can render in UI
Cons:
- Verbose (<strong>, </strong>, etc.)
- More tokens than markdown
- HTML tags in embeddings
Inline Code vs Code Blocks
Different code representations need different handling:
Inline Code:
Use the `Authorization` header with value `Bearer {token}`.
Preservation Importance:
Without backticks:
"Use the Authorization header with value Bearer {token}."
Problems:
→ "Bearer {token}" might be interpreted as prose
→ LLM might rephrase as "Bearer token"
→ User needs exact literal: "Bearer {token}"
With backticks preserved:
"Use the `Authorization` header with value `Bearer {token}`."
LLM recognizes:
→ These are literal strings
→ Don't paraphrase
→ Use exactly as written
Code Blocks:
```python
def authenticate(api_key):
headers = {"Authorization": f"Bearer {api_key}"}
return requests.post(url, headers=headers)
```
Extraction Challenge:
Should we:
1. Keep triple-backticks and language identifier?
→ "```python\ndef authenticate..."
→ Preserves language context
→ But: Extra tokens
2. Strip fences, keep code?
→ "def authenticate(api_key):..."
→ Cleaner, fewer tokens
→ But: Lost language info
3. Add language metadata separately?
→ Text: "def authenticate..."
→ Metadata: {language: "python"}
→ Clean text, structured metadata
→ Requires metadata support
Link Handling and References
Links have both anchor text and destination:
Link Structure:
[GitHub Repository](https://github.com/company/repo)
Extraction Options:
Option 1: Keep anchor text only
→ "GitHub Repository"
→ Readable, concise
→ But: URL lost, can't navigate
Option 2: Keep URL only
→ "https://github.com/company/repo"
→ URL preserved
→ But: Lost descriptive text
Option 3: Expand to "text (url)"
→ "GitHub Repository (https://github.com/company/repo)"
→ Both preserved
→ But: More verbose
Option 4: Keep markdown syntax
→ "[GitHub Repository](https://github.com/company/repo)"
→ Full information
→ But: Extra characters
Reference Link Problem:
See the [API Guide][1] for details.
[1]: https://docs.example.com/api
Reference Resolution:
Reference defined elsewhere in document:
→ May be in different chunk
→ Must resolve before chunking
→ Or: Store unresolved references
Without resolution:
Chunk 1: "See the [API Guide][1] for details."
Chunk 2: "[1]: https://docs.example.com/api"
LLM sees Chunk 1:
→ "[1]" reference not resolved
→ Incomplete information
List Structure Preservation
Lists have hierarchical structure:
Ordered Lists:
1. First step
2. Second step
1. Sub-step A
2. Sub-step B
3. Third step
Flattening Problem:
Naive extraction:
"First step Second step Sub-step A Sub-step B Third step"
Lost:
→ Numbering (sequence matters)
→ Hierarchy (2.1, 2.2 are children of 2)
→ Step boundaries
Better extraction:
"Step 1: First step. Step 2: Second step (Sub-step A, Sub-step B). Step 3: Third step."
Preserves:
→ Sequence
→ Hierarchy (parenthetical sub-steps)
→ Boundaries
Unordered Lists:
Features:
- Fast performance
- Easy to use
- Highly scalable
Structure Loss:
Flattened:
"Features: Fast performance Easy to use Highly scalable"
Better:
"Features: (1) Fast performance, (2) Easy to use, (3) Highly scalable"
Or:
"Features include: Fast performance. Easy to use. Highly scalable."
Blockquotes and Callouts
Special blocks carry semantic meaning:
Blockquote:
> **Important**: Always validate user input before processing.
Semantic Significance:
Without formatting:
"Important: Always validate user input before processing."
Lost: Blockquote indicates this is advisory/warning
→ Not just normal prose
→ Carries weight, should be emphasized
With preservation:
"Note (blockquote): Important: Always validate user input before processing."
Or keep markdown:
"> **Important**: Always validate user input before processing."
Callout Boxes (GitHub-flavored):
> [!WARNING]
> This action cannot be undone!
> [!NOTE]
> This is useful information.
Semantic Type Loss:
Extracted as plain text:
"This action cannot be undone! This is useful information."
Lost distinction:
→ WARNING (critical) vs NOTE (informational)
→ Both treated equally
Better:
"Warning: This action cannot be undone."
"Note: This is useful information."
Header Hierarchy
Headers define document structure:
Markdown Headers:
# Main Title
## Section
### Subsection
#### Sub-subsection
Structural Information:
Flattened:
"Main Title Section Subsection Sub-subsection"
All run together, no hierarchy
Better approach:
"Document: Main Title > Section > Subsection > Sub-subsection"
Or preserve markdown:
"# Main Title\n## Section\n### Subsection"
Headers tell LLM:
→ Document structure
→ Topic hierarchy
→ Importance (# > ## > ###)
Embedding Model Considerations
How models handle formatted text:
Training Data Exposure:
LLMs (GPT-4, Claude) trained on:
→ Lots of markdown (GitHub, docs)
→ Understand **bold**, `code`, etc.
→ Can interpret formatting natively
Embedding models (ada-002) trained on:
→ Mixed text (markdown + plain)
→ Uncertain how well they handle formatting
Question:
→ Does "Use the **POST** method" embed differently than "Use the POST method"?
→ Does ** add noise or helpful signal?
→ No clear answer, model-dependent
Token Efficiency:
Plain text: "Set DEBUG to true" (4 tokens)
Markdown: "Set `DEBUG` to `true`" (8 tokens with backticks)
2x tokens for same semantic content
→ Higher embedding cost
→ Fills context window faster
Trade-off:
→ Semantic precision vs token efficiency
How to Solve
Keep markdown syntax intact for LLM consumption + convert special blocks (warnings, notes) to natural language labels + preserve inline code backticks + expand links to "text (URL)" format + maintain list numbering with explicit labels. See Markdown Handling.
Agent Instructions: Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.
Perform an HTTP GET request on the current page URL with the ask query parameter:
GET /dev/rag-scenarios-and-solutions/chunking/markdown-lost.md?ask=<question>
The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.
Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
Related Pages
Integrations
Industries
Comparisons
Last updated January 26, 2026


