Rag Scenarios And Solutions
Broken Cross-References
Links and references between documents break during ingestion, causing AI to cite non-existent pages or fail to follow related content.
TL;DR
Links and references between documents break during ingestion, causing AI to cite non-existent pages or fail to follow related content.
Key Takeaways
- The Problem
- Deep Technical Analysis
- How to Solve
- Agent Instructions: Querying This Documentation
The Problem
Links and references between documents break during ingestion, causing AI to cite non-existent pages or fail to follow related content.
Symptoms
- ❌ "See section 3.2" - but section not linked
- ❌ Hyperlinks become plain text
- ❌ "Click here" with no actual link
- ❌ Cross-document references lost
- ❌ Cannot navigate related content
Real-World Example
Source HTML documentation:
"For authentication details, see <a href="/docs/auth">Authentication Guide</a>"
After ingestion:
"For authentication details, see Authentication Guide"
→ Link lost, just plain text
AI response:
"See Authentication Guide for details"
→ User: "Where is Authentication Guide?"
→ No way to navigate
Deep Technical Analysis
Link Extraction Failure
HTML to Text Conversion:
HTML: <a href="/docs/setup">setup instructions</a>
Naive extraction: "setup instructions"
→ Link URL lost
Better extraction:
"setup instructions (/docs/setup)"
→ Preserve URL in text
Or metadata:
{
text: "setup instructions",
link: "/docs/setup",
link_type: "internal"
}
Relative vs Absolute URLs:
Relative: href="/docs/auth"
→ Needs base URL to resolve
→ Without base: Broken link
Absolute: href="https://example.com/docs/auth"
→ Self-contained
→ But: May be external (outside knowledge base)
Must normalize to absolute
Internal Reference Resolution
Section References:
"See section 3.2 for details"
→ Implicit reference
→ Which document's section 3.2?
Without context:
→ Cannot resolve
→ Link broken
Need: Document structure metadata
Anchor Links:
"<a href="#troubleshooting">Jump to troubleshooting</a>"
→ Same-page anchor
→ Page context lost after chunking
Chunk 5: "Jump to troubleshooting"
→ Where is "troubleshooting" section?
→ In chunk 12 of same document
Need: Intra-document link mapping
Citation Accuracy
"See Also" Links:
Documentation: "See also: Rate Limiting, Authentication"
→ Related topics listed
After ingestion:
→ Just plain text
→ No links to those topics
AI can mention them:
→ But cannot provide direct access
→ User must search manually
Page Numbers:
PDF: "See page 42 for details"
→ Page numbers lost in text extraction
→ PDF converted to continuous text
"See page 42" meaningless without page structure
→ Need: Map page numbers to chunk IDs
Link Preservation Strategies
Markdown Format:
Store as Markdown with links:
"For details, see [Authentication Guide](/docs/auth)"
Benefits:
→ Links preserved
→ Can render as HTML
→ AI can cite with link
Metadata:
{
markdown: "...with [link](url)...",
links: ["/docs/auth"]
}
Hyperlink Metadata:
Each chunk:
{
text: "...",
outbound_links: [
{url: "/docs/auth", anchor_text: "Authentication Guide"},
{url: "#section-3", anchor_text: "section 3"}
],
inbound_links: [...]
}
Enables:
→ Link graph construction
→ Related content discovery
How to Solve
Preserve links during extraction (convert to Markdown or metadata) + resolve relative URLs to absolute + extract and store hyperlink metadata with chunks + implement document graph (cross-references) + map PDF page numbers to chunk IDs + include source URLs in AI citations + test link integrity post-ingestion. See Link Management.
Agent Instructions: Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.
Perform an HTTP GET request on the current page URL with the ask query parameter:
GET /dev/rag-scenarios-and-solutions/data-quality/broken-links.md?ask=<question>
The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.
Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
Related Pages
Last updated January 26, 2026


