Rag Scenarios And Solutions
Inconsistent Document Metadata
Missing, incorrect, or inconsistent metadata across documents prevents effective filtering, search, and access control in RAG retrieval.
TL;DR
Missing, incorrect, or inconsistent metadata across documents prevents effective filtering, search, and access control in RAG retrieval.
Key Takeaways
- The Problem
- Deep Technical Analysis
- How to Solve
- Agent Instructions: Querying This Documentation
The Problem
Missing, incorrect, or inconsistent metadata across documents prevents effective filtering, search, and access control in RAG retrieval.
Symptoms
- ❌ Some docs have metadata, others don't
- ❌ Same field different names ("author" vs "created_by")
- ❌ Dates in multiple formats
- ❌ Cannot filter by category/department
- ❌ Access control metadata missing
Real-World Example
Document A metadata:
{
"author": "john.smith@company.com",
"created": "2024-01-15",
"department": "Engineering",
"sensitivity": "internal"
}
Document B metadata:
{
"created_by": "Jane Doe",
"date": "Jan 15, 2024",
"dept": "Eng"
}
Document C metadata:
{
// No metadata at all
}
Query with filter: WHERE department = "Engineering"
→ Matches Doc A only
→ Doc B uses "dept" (different field)
→ Doc C has no metadata
→ Incomplete results despite relevant content in B and C
Deep Technical Analysis
Schema Inconsistency
Field Name Variations:
Same concept, different names:
→ "author", "created_by", "owner", "contributor"
→ "date", "created", "timestamp", "published_date"
→ "category", "type", "classification", "tag"
Queries break:
→ Filter by "author" misses "created_by"
→ Manual mapping required
Data Type Mismatches:
Date field:
→ Doc A: "2024-01-15" (ISO 8601)
→ Doc B: "January 15, 2024" (text)
→ Doc C: 1705276800 (Unix timestamp)
Cannot compare:
→ WHERE date > "2024-01-01"
→ Only matches Doc A (ISO format)
→ Others incompatible
Missing Metadata
Incomplete Extraction:
Source: Confluence page
→ Has author, date, labels (tags)
Extraction:
→ Captures title, body
→ Misses labels (not in API response)
Result: Metadata incomplete
Legacy Documents:
Old docs imported:
→ Created before metadata standards
→ Missing required fields
→ Cannot retroactively add without manual review
Metadata gaps persist
Normalization Strategies
Schema Standardization:
Define canonical schema:
{
"author": "string",
"created_at": "ISO 8601 datetime",
"department": "string (controlled vocabulary)",
"sensitivity": "enum: public|internal|confidential",
"document_type": "enum: policy|guide|api_doc"
}
Map all inputs to this schema
Field Mapping:
Ingestion pipeline:
→ Detect source schema
→ Map to canonical:
- "created_by" → "author"
- "dept" → "department"
- Normalize: "Eng" → "Engineering"
Ensures consistency
Default Values:
Required fields:
→ If missing, use default
→ "author": "unknown"
→ "sensitivity": "internal" (safe default)
Prevents null/missing values breaking queries
Controlled Vocabularies
Department Field:
Problem: Free text
→ "Engineering", "Eng", "engineering", "ENGINEERING", "R&D"
Solution: Enum
→ Valid values: ["Engineering", "Sales", "Support", "HR"]
→ Reject or map invalid values
Enables reliable filtering
Tag Standardization:
Tags: ["api", "API", "rest-api", "REST API", "restapi"]
→ All mean same thing
Normalize:
→ Lowercase: "api"
→ Canonical form: "rest-api"
Consistent tagging
How to Solve
Define canonical metadata schema upfront + implement field mapping during ingestion (source schema → canonical) + normalize data types (all dates to ISO 8601) + use controlled vocabularies for categories/departments + set safe defaults for missing required fields + validate metadata at ingestion + audit and remediate legacy docs. See Metadata Standards.
Agent Instructions: Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.
Perform an HTTP GET request on the current page URL with the ask query parameter:
GET /dev/rag-scenarios-and-solutions/data-quality/metadata-inconsistent.md?ask=<question>
The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.
Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
Related Pages
Last updated January 26, 2026


