Rag Scenarios And Solutions
Footnotes and References Lost
Footnotes, endnotes, and citations are separated from their reference markers during chunking, losing critical context and academic/legal citations.
TL;DR
Footnotes, endnotes, and citations are separated from their reference markers during chunking, losing critical context and academic/legal citations.
Key Takeaways
- The Problem
- Deep Technical Analysis
- How to Solve
- Agent Instructions: Querying This Documentation
The Problem
Footnotes, endnotes, and citations are separated from their reference markers during chunking, losing critical context and academic/legal citations.
Symptoms
- ❌ Reference markers ([1], *, †) appear without footnotes
- ❌ Footnotes separated from main text
- ❌ "See note 5" but note 5 not in chunk
- ❌ Academic citations incomplete
- ❌ Legal references missing source
Real-World Example
Original document:
The API rate limit¹ is 1000 requests per hour for free users².
────────────────────────
¹ Rate limits reset at midnight UTC
² Enterprise plans have higher limits
Chunk boundary falls between text and footnotes ↓
Chunk 1:
"The API rate limit¹ is 1000 requests per hour for free users²."
Chunk 2:
"¹ Rate limits reset at midnight UTC
² Enterprise plans have higher limits"
User sees Chunk 1: "What does ¹ mean?"
AI cannot resolve reference (footnote in different chunk)
Deep Technical Analysis
Footnote Types and Formats
Different notation systems:
Numbering Systems:
Numeric: [1], [2], [3] or ¹, ², ³
Alphabetic: [a], [b], [c]
Symbolic: *, †, ‡, §, ||, ¶
Roman: [i], [ii], [iii]
Placement Variations:
Bottom of page (traditional):
Main text with marker¹
────────────────────
¹ Footnote content
End of section:
Main text with marker¹
[Section continues...]
Notes:
¹ Footnote content
End of document (endnotes):
Main text with marker¹
[Many pages later...]
Endnotes:
¹ Footnote content
Detection Challenges:
Superscript numbers:
→ Could be footnote: "value¹"
→ Or exponent: "x²"
→ Or chemical: "CO₂"
Context needed to distinguish
Square brackets:
→ Could be footnote: "research [1] shows..."
→ Or citation: "see [Smith, 2020]"
→ Or array notation: "array[1]"
→ Or just brackets: "[optional] parameter"
Marker-to-Note Matching
Associating references with definitions:
Matching Algorithm:
1. Scan document for markers: ¹, ², ³
2. Scan document for footnote definitions:
→ Look for "¹ " at start of line
→ Or "──" separator followed by notes
3. Match markers to definitions by number/symbol
4. Store associations
Challenges:
→ Multiple numbering systems in same doc
→ Nested footnotes (footnotes referencing footnotes)
→ Reused numbers across chapters
→ Non-contiguous numbering (1, 2, 5, 7 - some missing)
The Multiple Reference Problem:
Main text:
"This concept¹ was explored by several researchers¹."
Same footnote referenced twice
→ Footnote 1 appears once at bottom
→ Two markers in text
→ Must link both to same footnote
Cross-Chapter Footnotes:
Chapter 1 footnotes: 1-15
Chapter 2 footnotes: 1-12 (numbering restarts!)
Marker "2" in Chapter 2 ≠ Marker "2" in Chapter 1
→ Must track context (chapter/section)
→ Avoid mixing footnotes across chapters
Academic Citations
Scholarly documents use formal citations:
Citation Formats:
IEEE: [1], [2], [3] (numeric)
APA: (Smith, 2020) (author-year)
MLA: (Smith 24) (author-page)
Chicago: Superscript¹ (note-based)
Harvard: (Smith 2020, p.15) (author-year-page)
Inline vs Bibliography:
Inline citation:
"Previous research [1] demonstrated..."
Bibliography (separate section):
[1] Smith, J. (2020). Title of Paper. Journal Name, 15(3), 123-145.
Chunking challenge:
→ Citation [1] in main text
→ Full reference in bibliography (different location)
→ May be in different chunk entirely
LLM sees [1] but can't resolve to full citation
Citation Clustering:
Multiple citations together:
"This is well documented [1,2,3,5-8,12]."
Represents:
→ 9 separate citations (1,2,3,5,6,7,8,12)
→ Must expand ranges (5-8)
→ Link all to bibliography
If chunk contains this line:
→ Need ALL 9 bibliography entries
→ But they may span multiple pages
→ Impractical to include all
Legal Citations
Legal documents have specific citation requirements:
Legal Citation Format:
Case law: Smith v. Jones, 123 F.3d 456 (9th Cir. 2020)
Statute: 42 U.S.C. § 1983
Regulation: 17 C.F.R. § 240.10b-5
Components:
→ Case name
→ Reporter volume & page
→ Court
→ Year
The String Citation:
Legal writing uses "string citations":
"This principle is established. See Smith v. Jones, 123 F.3d 456, 460 (9th Cir. 2020); Doe v. Roe, 789 F.2d 123, 125 (2d Cir. 2019); Johnson v. Williams, 456 F.Supp. 789 (S.D.N.Y. 2018)."
Single sentence with 3 citations
→ Must keep together
→ Splitting loses context
→ "See Smith v. Jones" alone is incomplete
Abbreviated Citations:
First reference (full):
"Smith v. Jones, 123 F.3d 456 (9th Cir. 2020)"
Later references (short):
"Smith, 123 F.3d at 460"
or just: "Id. at 461" (same case as previous)
"Id." depends on previous citation
→ Must track citation history
→ If chunk starts mid-document: "Id." unresolvable
Footnote Content Length
Footnotes vary from brief to extensive:
Short Footnotes:
Main text: "The API¹ supports JSON."
Footnote: ¹ Application Programming Interface
Brief definition: 5-10 words
→ Easy to include with main text
Long Footnotes:
Main text: "The algorithm¹ is efficient."
Footnote: ¹ The algorithm is based on dynamic programming principles first introduced by Bellman (1957) and later refined by Dijkstra (1959). Modern implementations typically use a priority queue for efficiency. For a comprehensive treatment of the theoretical foundations, see Cormen et al. (2009), Chapter 24. Note that the worst-case time complexity is O(n²) for dense graphs but can be reduced to O(n log n) with appropriate data structures. [200 words...]
Footnote longer than main text!
→ Including footnote inflates chunk size
→ Excluding footnote loses critical detail
The Inclusion Decision:
Options:
1. Always include footnotes in chunk
→ Pros: Complete context
→ Cons: Very large chunks, repetition
2. Never include footnotes
→ Pros: Smaller chunks
→ Cons: Incomplete information
3. Include short footnotes (<50 words)
→ Pros: Balance
→ Cons: Arbitrary threshold, inconsistent
4. Include footnotes as separate chunks with back-references
→ Pros: Modular
→ Cons: Complex linking required
Inline Notes vs Margin Notes
Different annotation styles:
Inline Parenthetical:
"The result (as shown in Figure 3) demonstrates..."
Not a footnote, but similar
→ Interruptive aside
→ Could be moved to footnote
→ But author chose inline
Should chunk preserve parenthetical positioning?
Or normalize: "The result demonstrates... See Figure 3."
Margin Notes:
Document layout:
┌────────────────┬──────────┐
│ Main text here │ Note: Im-│
│ continues with │ portant │
│ more content. │ detail! │
└────────────────┴──────────┘
Margin note parallel to main text
→ Not clearly "after" any specific paragraph
→ Extraction order ambiguous
→ Associate with nearby text? Which paragraph?
Reference Loops and Nested Notes
Complex referencing structures:
Footnote Referencing Footnote:
Main text: "The concept¹ is fundamental."
Footnote 1: "See also related work²"
Footnote 2: "Smith (2020) provides comprehensive review."
Multi-level reference chain
→ Resolving ¹ requires ²
→ Deep linking required
Circular References:
Footnote A: "See Footnote B for details"
Footnote B: "As mentioned in Footnote A..."
Circular dependency
→ Cannot resolve independently
→ Must include both together
Embedding and Retrieval Impact
Footnotes affect semantic search:
Footnote Content in Embeddings:
Query: "API rate limit reset time"
Option 1: Embed main text only
"The API rate limit is 1000 requests per hour."
→ Doesn't match query (no "reset" mentioned)
Option 2: Embed main text + footnotes
"The API rate limit is 1000 requests per hour. Rate limits reset at midnight UTC."
→ Matches query! Footnote has answer
Conclusion: Must include footnotes for complete semantic coverage
Citation Noise:
Academic paper full of citations:
"This approach [1,2,3] outperforms baseline [4,5] with significant improvements [6,7,8,9]."
Embedding includes: [1,2,3,4,5,6,7,8,9]
→ Citation numbers are noise
→ No semantic value
→ Dilute actual content signal
Should strip [1-9] before embedding?
→ But then lose traceability
→ Cannot cite sources
How to Solve
Detect footnote markers (superscripts, brackets) + match to footnote definitions at page/section end + inline short footnotes (<50 words) directly + link long footnotes as metadata + resolve "Id." and abbreviated citations + strip citation brackets from embeddings but store separately. See Footnote Handling.
Agent Instructions: Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.
Perform an HTTP GET request on the current page URL with the ask query parameter:
GET /dev/rag-scenarios-and-solutions/chunking/footnotes-lost.md?ask=<question>
The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.
Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
Related Pages
Last updated January 26, 2026


