Rag Scenarios And Solutions

Broken Cross-References

Links and references between documents break during ingestion, causing AI to cite non-existent pages or fail to follow related content.

TL;DR

Links and references between documents break during ingestion, causing AI to cite non-existent pages or fail to follow related content.

Key Takeaways

  • The Problem
  • Deep Technical Analysis
  • How to Solve
  • Agent Instructions: Querying This Documentation

The Problem

Links and references between documents break during ingestion, causing AI to cite non-existent pages or fail to follow related content.

Symptoms

  • ❌ "See section 3.2" - but section not linked
  • ❌ Hyperlinks become plain text
  • ❌ "Click here" with no actual link
  • ❌ Cross-document references lost
  • ❌ Cannot navigate related content

Real-World Example

Source HTML documentation:
"For authentication details, see <a href="/docs/auth">Authentication Guide</a>"

After ingestion:
"For authentication details, see Authentication Guide"
→ Link lost, just plain text

AI response:
"See Authentication Guide for details"
→ User: "Where is Authentication Guide?"
→ No way to navigate

Deep Technical Analysis

HTML to Text Conversion:

HTML: <a href="/docs/setup">setup instructions</a>

Naive extraction: "setup instructions"
→ Link URL lost

Better extraction:
"setup instructions (/docs/setup)"
→ Preserve URL in text

Or metadata:
{
  text: "setup instructions",
  link: "/docs/setup",
  link_type: "internal"
}

Relative vs Absolute URLs:

Relative: href="/docs/auth"
→ Needs base URL to resolve
→ Without base: Broken link

Absolute: href="https://example.com/docs/auth"
→ Self-contained
→ But: May be external (outside knowledge base)

Must normalize to absolute

Internal Reference Resolution

Section References:

"See section 3.2 for details"
→ Implicit reference
→ Which document's section 3.2?

Without context:
→ Cannot resolve
→ Link broken

Need: Document structure metadata

Anchor Links:

"<a href="#troubleshooting">Jump to troubleshooting</a>"
→ Same-page anchor
→ Page context lost after chunking

Chunk 5: "Jump to troubleshooting"
→ Where is "troubleshooting" section?
→ In chunk 12 of same document

Need: Intra-document link mapping

Citation Accuracy

"See Also" Links:

Documentation: "See also: Rate Limiting, Authentication"
→ Related topics listed

After ingestion:
→ Just plain text
→ No links to those topics

AI can mention them:
→ But cannot provide direct access
→ User must search manually

Page Numbers:

PDF: "See page 42 for details"
→ Page numbers lost in text extraction
→ PDF converted to continuous text

"See page 42" meaningless without page structure
→ Need: Map page numbers to chunk IDs

Markdown Format:

Store as Markdown with links:
"For details, see [Authentication Guide](/docs/auth)"

Benefits:
→ Links preserved
→ Can render as HTML
→ AI can cite with link

Metadata:
{
  markdown: "...with [link](url)...",
  links: ["/docs/auth"]
}

Hyperlink Metadata:

Each chunk:
{
  text: "...",
  outbound_links: [
    {url: "/docs/auth", anchor_text: "Authentication Guide"},
    {url: "#section-3", anchor_text: "section 3"}
  ],
  inbound_links: [...]
}

Enables:
→ Link graph construction
→ Related content discovery

How to Solve

Preserve links during extraction (convert to Markdown or metadata) + resolve relative URLs to absolute + extract and store hyperlink metadata with chunks + implement document graph (cross-references) + map PDF page numbers to chunk IDs + include source URLs in AI citations + test link integrity post-ingestion. See Link Management.


Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the ask query parameter:

GET /dev/rag-scenarios-and-solutions/data-quality/broken-links.md?ask=<question>

The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.

Related Pages

Last updated January 26, 2026