Rag Scenarios And Solutions

Inconsistent Document Metadata

Missing, incorrect, or inconsistent metadata across documents prevents effective filtering, search, and access control in RAG retrieval.

TL;DR

Missing, incorrect, or inconsistent metadata across documents prevents effective filtering, search, and access control in RAG retrieval.

Key Takeaways

  • The Problem
  • Deep Technical Analysis
  • How to Solve
  • Agent Instructions: Querying This Documentation

The Problem

Missing, incorrect, or inconsistent metadata across documents prevents effective filtering, search, and access control in RAG retrieval.

Symptoms

  • ❌ Some docs have metadata, others don't
  • ❌ Same field different names ("author" vs "created_by")
  • ❌ Dates in multiple formats
  • ❌ Cannot filter by category/department
  • ❌ Access control metadata missing

Real-World Example

Document A metadata:
{
  "author": "john.smith@company.com",
  "created": "2024-01-15",
  "department": "Engineering",
  "sensitivity": "internal"
}

Document B metadata:
{
  "created_by": "Jane Doe",
  "date": "Jan 15, 2024",
  "dept": "Eng"
}

Document C metadata:
{
  // No metadata at all
}

Query with filter: WHERE department = "Engineering"
→ Matches Doc A only
→ Doc B uses "dept" (different field)
→ Doc C has no metadata
→ Incomplete results despite relevant content in B and C

Deep Technical Analysis

Schema Inconsistency

Field Name Variations:

Same concept, different names:
→ "author", "created_by", "owner", "contributor"
→ "date", "created", "timestamp", "published_date"
→ "category", "type", "classification", "tag"

Queries break:
→ Filter by "author" misses "created_by"
→ Manual mapping required

Data Type Mismatches:

Date field:
→ Doc A: "2024-01-15" (ISO 8601)
→ Doc B: "January 15, 2024" (text)
→ Doc C: 1705276800 (Unix timestamp)

Cannot compare:
→ WHERE date > "2024-01-01"
→ Only matches Doc A (ISO format)
→ Others incompatible

Missing Metadata

Incomplete Extraction:

Source: Confluence page
→ Has author, date, labels (tags)

Extraction:
→ Captures title, body
→ Misses labels (not in API response)

Result: Metadata incomplete

Legacy Documents:

Old docs imported:
→ Created before metadata standards
→ Missing required fields
→ Cannot retroactively add without manual review

Metadata gaps persist

Normalization Strategies

Schema Standardization:

Define canonical schema:
{
  "author": "string",
  "created_at": "ISO 8601 datetime",
  "department": "string (controlled vocabulary)",
  "sensitivity": "enum: public|internal|confidential",
  "document_type": "enum: policy|guide|api_doc"
}

Map all inputs to this schema

Field Mapping:

Ingestion pipeline:
→ Detect source schema
→ Map to canonical:
  - "created_by" → "author"
  - "dept" → "department"
  - Normalize: "Eng" → "Engineering"

Ensures consistency

Default Values:

Required fields:
→ If missing, use default
→ "author": "unknown"
→ "sensitivity": "internal" (safe default)

Prevents null/missing values breaking queries

Controlled Vocabularies

Department Field:

Problem: Free text
→ "Engineering", "Eng", "engineering", "ENGINEERING", "R&D"

Solution: Enum
→ Valid values: ["Engineering", "Sales", "Support", "HR"]
→ Reject or map invalid values

Enables reliable filtering

Tag Standardization:

Tags: ["api", "API", "rest-api", "REST API", "restapi"]
→ All mean same thing

Normalize:
→ Lowercase: "api"
→ Canonical form: "rest-api"

Consistent tagging

How to Solve

Define canonical metadata schema upfront + implement field mapping during ingestion (source schema → canonical) + normalize data types (all dates to ISO 8601) + use controlled vocabularies for categories/departments + set safe defaults for missing required fields + validate metadata at ingestion + audit and remediate legacy docs. See Metadata Standards.


Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the ask query parameter:

GET /dev/rag-scenarios-and-solutions/data-quality/metadata-inconsistent.md?ask=<question>

The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.

Related Pages

Last updated January 26, 2026