Entity Resolution Errors

The Problem

Same real-world entities referenced inconsistently (different names, IDs, spellings) causing fragmented information and failed connections across documents.

Symptoms

❌ "John Smith" and "J. Smith" treated as different people
❌ Cannot find all docs mentioning same entity
❌ Relationships broken by naming variations
❌ Duplicate entity entries
❌ Cross-reference failures

Real-World Example

Knowledge base references:
→ Doc A: "Project Phoenix led by Alice Johnson"
→ Doc B: "A. Johnson manages infrastructure"
→ Doc C: "Contact alice.johnson@company.com for access"
→ Doc D: "User ID 12345 owns this repository"

All refer to SAME person (Alice Johnson, user_12345)
But treated as 4 different entities:
→ "Alice Johnson"
→ "A. Johnson"
→ "alice.johnson@company.com"
→ "User ID 12345"

Query: "What projects does Alice Johnson lead?"
→ Only finds Doc A (exact match "Alice Johnson")
→ Misses B, C, D (different representations)
→ Incomplete answer

Deep Technical Analysis

Entity Variation Types

Name Variations:

Same person:
→ Full: "Robert James Smith"
→ Common: "Bob Smith"
→ Formal: "R. J. Smith"
→ Email: "robert.smith"
→ Nickname: "Bobby"

Without resolution:
→ 5 separate entities in knowledge base
→ Information fragmented

Organizational Variations:

Same company:
→ "International Business Machines"
→ "IBM"
→ "IBM Corporation"
→ "IBM Corp."

Same product:
→ "Microsoft Office 365"
→ "Office 365"
→ "O365"
→ "M365"

Canonical Entity Mapping

Entity ID Assignment:

Create canonical identifiers:
→ Person: user_id (from auth system)
→ Company: domain or LEI
→ Product: SKU or product_id

Map all variations:
{
  "Alice Johnson": "user_12345",
  "A. Johnson": "user_12345",
  "alice.johnson@company.com": "user_12345"
}

All references point to same canonical ID

Entity Linking:

During ingestion:
1. Extract entities from text (NER - Named Entity Recognition)
2. Resolve to canonical ID
3. Tag chunk with canonical entities

Chunk metadata:
{
  text: "Project Phoenix led by Alice Johnson",
  entities: [
    {name: "Alice Johnson", canonical_id: "user_12345", type: "person"},
    {name: "Project Phoenix", canonical_id: "project_789", type: "project"}
  ]
}

Retrieval by entity:
WHERE entities CONTAINS "user_12345"
→ Finds all chunks mentioning Alice (any variation)

Fuzzy Matching

String Similarity:

Detect likely matches:
→ "Alice Johnson" vs "Alyce Jonson" (typo)
→ Levenshtein distance: 2 edits
→ Likely same entity (needs verification)

→ "Bob Smith" vs "Robert Smith"
→ No string similarity
→ But: Bob = common nickname for Robert
→ Requires nickname dictionary

Probabilistic Matching:

Dedupe library (Python):
→ Fuzzy matching algorithm
→ Assigns probability: 85% same entity
→ Threshold: >80% = match

Automated entity resolution

Named Entity Recognition (NER)

Entity Extraction:

SpaCy, Stanford NER:
→ Identifies: PERSON, ORG, PRODUCT, LOCATION
→ "Alice Johnson manages Project Phoenix"
  - PERSON: Alice Johnson
  - PROJECT: Project Phoenix

Links entities across documents

Entity Disambiguation:

Challenge: Same name, different entities
→ "Apple" (company) vs "apple" (fruit)
→ "Paris" (city) vs "Paris Hilton" (person)

Context-based disambiguation:
→ "Apple released iPhone" → ORG
→ "I ate an apple" → FOOD

Requires context analysis

Co-Reference Resolution

Pronoun Resolution:

Text: "Alice manages the project. She reports to Bob."
→ "She" = "Alice"

Co-reference resolution:
→ Replace pronouns with entities
→ "Alice manages the project. Alice reports to Bob."

Clearer entity relationships

How to Solve

Implement NER (spaCy, Stanford NER) to extract entities + create canonical entity IDs (user_id, product_id) + build entity mapping table (all variations → canonical) + use fuzzy string matching (Dedupe library) for likely matches + tag chunks with canonical entity IDs in metadata + enable entity-based retrieval (find all chunks mentioning user_12345) + implement co-reference resolution for pronouns. See Entity Resolution.

Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the ask query parameter:

GET /dev/rag-scenarios-and-solutions/data-quality/entity-resolution.md?ask=<question>

The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.

Entity Resolution Errors

Key Takeaways

The Problem

Symptoms

Real-World Example

Deep Technical Analysis

Entity Variation Types

Canonical Entity Mapping

Fuzzy Matching

Named Entity Recognition (NER)

Co-Reference Resolution

How to Solve

Agent Instructions: Querying This Documentation

Related Pages

Integrations

Industries

Comparisons

Compliance

Investors

Industry