Rag Scenarios And Solutions
Multi-Source Sync Conflicts
When syncing the same content from multiple data sources, conflicts arise causing duplicates, inconsistent updates, or lost changes.
TL;DR
When syncing the same content from multiple data sources, conflicts arise causing duplicates, inconsistent updates, or lost changes.
Key Takeaways
- The Problem
- Deep Technical Analysis
- How to Solve
- Agent Instructions: Querying This Documentation
The Problem
When syncing the same content from multiple data sources, conflicts arise causing duplicates, inconsistent updates, or lost changes.
Symptoms
- ❌ Same document appears twice in search results
- ❌ Conflicting information from Confluence vs Notion
- ❌ Update in one source doesn't reflect in AI agent
- ❌ Can't determine which source has latest version
- ❌ Duplicate embeddings wasting storage
Real-World Example
"API Authentication Guide" exists in:
1. Confluence: updated 2 days ago (v1.2)
2. Notion: updated yesterday (v1.3)
3. Website /docs/auth: updated today (v1.4)
User asks: "How do I authenticate?"
AI response includes mix of v1.2, v1.3, v1.4 steps
→ Inconsistent, confusing answer
→ User doesn't know which is correct
Deep Technical Analysis
Content Duplication Across Sources
Same logical content exists in multiple systems:
The Multi-System Reality:
Modern organizations:
→ Documentation in Confluence
→ Same content copied to Notion (team wiki)
→ Published version on website (public docs)
→ Synced to Zendesk (support articles)
→ Discussed in Slack threads
Result:
→ 5 copies of "same" content
→ Each with slight variations
→ Different update timestamps
→ Different authors
Semantic Duplication Detection:
Simple approach: Exact match
→ Compare title + content hash
→ Only catches identical copies
→ Misses: rewording, formatting differences
Advanced: Semantic similarity
→ Embed both documents
→ Compute cosine similarity
→ If > 0.95: Consider duplicates
But:
→ Expensive (embed everything twice)
→ Threshold tuning (0.95 vs 0.90 vs 0.85?)
→ False positives ("Getting Started" guides for different products)
The Partial Overlap Problem:
Confluence doc: "Full API Reference" (10,000 words)
Website doc: "API Quick Start" (1,000 words)
Website doc is subset of Confluence doc:
→ Not duplicates
→ But overlapping content
→ Should both be in knowledge base?
Or:
→ Confluence: Sections A, B, C
→ Notion: Sections B, C, D
50% overlap, 50% unique
→ Keep both? (redundancy)
→ Merge? (how?)
→ Prefer one? (which?)
Conflict Resolution Strategies
When the same content differs across sources:
Last-Write-Wins (LWW):
Strategy: Use most recently updated version
Example:
→ Confluence: updated 2024-01-15
→ Notion: updated 2024-01-20
→ Keep: Notion version (newer)
Pros:
+ Simple logic
+ Respects recency
Cons:
- Ignores authority (maybe Confluence is canonical)
- Timestamp accuracy issues (clock skew)
- May prefer minor edit over substantial content
Source Priority:
Strategy: Assign priority to sources
Configuration:
1. Website (canonical, public)
2. Confluence (internal docs)
3. Notion (team notes)
4. Slack (informal)
Conflict resolution:
→ Same content in Website + Confluence
→ Keep: Website (higher priority)
→ Discard: Confluence duplicate
Pros:
+ Respects organizational hierarchy
+ Deterministic
Cons:
- Requires manual priority configuration
- Lower-priority sources may have newer info
- Not always clear which is canonical
Multi-Version Storage:
Strategy: Keep all versions, tag by source
Vector DB:
→ "API Auth Guide" from Confluence (chunk_conf_1)
→ "API Auth Guide" from Notion (chunk_notion_1)
→ "API Auth Guide" from Website (chunk_web_1)
Retrieval:
→ User query matches all 3
→ LLM sees all versions
→ Synthesizes answer or highlights conflicts
Pros:
+ No information loss
+ LLM can resolve conflicts
+ User sees full picture
Cons:
- 3x storage cost
- Retrieval noise (too many chunks)
- LLM may get confused by conflicts
Update Propagation and Consistency
Changes in one source don't auto-propagate to others:
The Update Lag Problem:
Timeline:
10:00 AM: User updates Confluence (adds new section)
10:30 AM: Twig syncs Confluence → knowledge base updated
12:00 PM: User asks AI question → Gets new info ✓
But:
→ Notion still has old version
→ Website still has old version
→ No sync triggered for these sources
Next day:
→ Notion syncs (still old content)
→ Now knowledge base has conflicting chunks:
- Confluence chunks (new, correct)
- Notion chunks (old, stale)
AI retrieval may return mix of both
The Content Drift Problem:
Initial state (all synchronized):
→ Confluence: "Use API key in header"
→ Notion: "Use API key in header"
→ Website: "Use API key in header"
Month 1: Confluence updated to "Use Bearer token"
Month 2: Website updated to "Use OAuth 2.0"
Month 3: Notion never updated (abandoned)
Current state:
→ Confluence: "Bearer token"
→ Website: "OAuth 2.0"
→ Notion: "API key" (stale)
All three in knowledge base, all retrieved
→ AI gives inconsistent answer with 3 methods
→ User confused which to use
Bidirectional Sync Impossibility
Most integrations are unidirectional:
The Read-Only Problem:
Twig's integration:
→ Reads from Confluence ✓
→ Writes to Confluence ✗ (not implemented)
Ideal bidirectional sync:
1. User updates Confluence → Twig syncs
2. Twig updates Notion with same change
3. Twig updates Website
4. All sources stay consistent
Reality:
→ Each source has its own auth/permissions
→ Each has different write APIs
→ Each has unique content structure
→ Automated writes risk data corruption
→ Twig is read-only by design (safer)
The Manual Reconciliation:
Current workflow:
1. User updates Confluence
2. Twig syncs Confluence
3. User must manually:
→ Copy changes to Notion
→ Update website repo (commit + deploy)
→ Update Zendesk article
4. Twig syncs each source independently
Human in the loop:
→ Error-prone
→ Time-consuming
→ Often forgotten
→ Leads to divergence over time
Metadata Conflicts and Merging
Beyond content, metadata can conflict:
Author Conflicts:
Same document:
→ Confluence: author = john@company.com
→ Notion: author = sarah@company.com
→ Website: author = docs-bot@company.com
Which to use in RAG metadata?
→ First author (John)?
→ Last author (docs-bot)?
→ All authors (John, Sarah, docs-bot)?
→ Source-specific (depends on where chunk came from)?
Tag Conflicts:
Confluence tags: ["api", "authentication", "v2"]
Notion tags: ["auth", "security", "oauth"]
Website categories: ["developers", "guides"]
Merging strategies:
1. Union: ["api", "authentication", "v2", "auth", "security", "oauth", "developers", "guides"]
→ Comprehensive but noisy
2. Intersection: [] (no common tags)
→ Too strict, loses all metadata
3. Normalize and merge: ["api", "authentication", "security"]
→ Requires tag mapping logic
4. Keep source-specific: {confluence: [...], notion: [...], website: [...]}
→ Preserves all, but complex queries
Deletion Conflicts
One source deletes, others don't:
The Partial Deletion:
User deletes Confluence page (outdated)
→ Twig removes Confluence chunks from vector DB
But:
→ Notion copy still exists
→ Website copy still published
→ Twig keeps those chunks
AI behavior:
→ Query matches Notion/Website chunks
→ AI cites "deleted" content (from other sources)
→ User thinks: "I deleted this!"
→ Confusion about what's authoritative
Cascading Deletion Decision:
Question: Should deleting from one source delete from all?
Option A: Cascade delete
→ Delete Confluence → remove all duplicates
→ Risk: Loses content from other valid sources
Option B: Independent deletion
→ Delete Confluence → only remove Confluence chunks
→ Other sources unaffected
→ Current behavior
Option C: Soft-delete with prompt
→ Detect deletion in one source
→ Notify user: "Also in Notion and Website, delete those too?"
→ User decides
→ Requires UI/workflow changes
Source-of-Truth Ambiguity
No clear canonical source:
The Authority Problem:
Engineering team:
→ Considers Confluence canonical
→ "If it's not in Confluence, it's not official"
Marketing team:
→ Considers Website canonical
→ "Public docs are the source of truth"
Support team:
→ Uses Zendesk as primary
→ "Zendesk is what customers see"
No organization-wide agreement:
→ Twig doesn't know which to prefer
→ Treats all sources equally
→ Conflicts unresolved
Cross-Source Search and Attribution
Users need to know source of information:
Source Attribution in Responses:
User query: "API authentication"
Retrieved chunks from:
1. Confluence: "Use Bearer tokens" (2 days old)
2. Website: "Use OAuth 2.0" (1 day old)
3. Notion: "Use API keys" (2 weeks old)
AI response must indicate:
→ "According to the Website (latest): Use OAuth 2.0.
Note: Confluence mentions Bearer tokens, and
older Notion docs reference API keys."
Requires:
→ Source metadata in every chunk
→ Timestamp comparison
→ LLM prompt engineering to cite sources
→ UI to display source badges
The Version Confusion:
No version tracking across sources:
→ Confluence: No version field
→ Notion: Version = "1.3" (manual)
→ Website: Git commit hash = "abc123"
Can't automatically determine:
→ Which is newest version semantically
→ Which represents production vs draft
→ Version lineage (is v1.4 based on v1.3?)
How to Solve
Implement content fingerprinting for duplicate detection + configure source priority + track last-updated-at per source + display source attribution in responses + implement periodic cross-source reconciliation. See Multi-Source Configuration.
Agent Instructions: Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.
Perform an HTTP GET request on the current page URL with the ask query parameter:
GET /dev/rag-scenarios-and-solutions/data-integration/sync-conflicts.md?ask=<question>
The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.
Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
Related Pages
Last updated January 26, 2026


