Synthetic Data

Synthetic data generation enhances your knowledge base by creating additional training examples, questions, and variations of existing content. This improves your AI agent's ability to understand and respond to a wider range of user queries.

What is Synthetic Data?

Synthetic data refers to artificially generated content that supplements your original data. In the context of AI agents, this typically includes:

Question-Answer Pairs: Generated questions based on your documents
Paraphrased Content: Alternative phrasings of existing information
Edge Cases: Variations covering uncommon query patterns
Expanded Examples: Additional context and use cases

Why Use Synthetic Data?

Coverage Enhancement

Original documentation often doesn't cover all possible ways users might ask questions. Synthetic data fills these gaps by:

Generating multiple question variations for each concept
Creating questions for different expertise levels
Covering different phrasings and terminology
Addressing implicit questions not explicitly stated in docs

Improved Retrieval

More diverse data improves semantic search:

Better embedding coverage of semantic space
Higher likelihood of matching user queries
Reduced dependency on exact keyword matches
Improved ranking of relevant results

Training and Testing

Synthetic data enables better evaluation:

Create test sets for measuring accuracy
Generate scenarios for stress testing
Validate coverage of key topics
Benchmark performance improvements

Types of Synthetic Data

Question Generation

Automatically generate questions from your existing content.

How It Works:

Analyze document chunks
Identify key concepts and facts
Generate relevant questions using LLMs
Associate questions with source content

Example:

Original content:

Twig AI supports integration with Slack, Microsoft Teams, 
and Zendesk. Each integration requires OAuth authentication 
and can be configured from the Integrations page.

Generated questions:

"What integrations does Twig AI support?"
"How do I connect Twig AI to Slack?"
"What authentication method is used for integrations?"
"Where can I configure integrations in Twig AI?"

Benefits:

Improves retrieval for question-style queries
Covers different ways of asking the same thing
Explicitly links questions to answers

Answer Generation

Create complete Q&A pairs from documentation.

How It Works:

Generate questions as above
Extract or generate concise answers
Include source references
Validate accuracy

Example Structure:

{
  "question": "How do I connect Twig AI to Slack?",
  "answer": "To connect Twig AI to Slack, go to the Integrations page and select Slack. You'll be prompted to authenticate using OAuth. Once authenticated, you can configure which channels to monitor.",
  "source": "integrations-guide.md",
  "section": "Slack Integration"
}

Benefits:

Provides ready-to-use Q&A format
Optimized for direct answering
Easier to validate and edit

Paraphrasing and Variation

Generate alternative phrasings of existing content.

Use Cases:

Different terminology (technical vs. layman)
Various language styles (formal vs. casual)
Different expertise levels (beginner vs. advanced)
Regional variations (US vs. UK English)

Example:

Original:

"Initialize the SDK by providing your API key and organization ID."

Variations:

"Start using the SDK by entering your API key and org ID."
"Set up the SDK with your credentials: API key and organization ID."
"To begin, configure the SDK using your API key and organization identifier."

Scenario Expansion

Create examples and use cases that illustrate concepts.

How It Works:

Identify abstract concepts
Generate concrete examples
Create step-by-step scenarios
Include expected outcomes

Example:

Original concept:

"You can filter data sources by category and date range."

Expanded scenario:

Example: Filtering Customer Support Tickets

1. Navigate to Data Sources
2. Select "Category" filter
3. Choose "Customer Support"
4. Set date range: Last 30 days
5. Click "Apply Filters"

Result: Only customer support tickets from the past month will be displayed, 
making it easier to train your agent on recent support interactions.

Edge Case Generation

Create examples covering unusual or complex scenarios.

Types of Edge Cases:

Error conditions
Unusual input formats
Boundary conditions
Complex multi-step workflows
Integration failure scenarios

Example:

Standard case:

"How do I upload a document?"

Edge cases:

"What happens if my document upload fails?"
"Can I upload documents larger than 10MB?"
"What if my PDF is password-protected?"
"How do I handle upload timeouts?"

Implementation Strategies

Automated Generation

Use LLMs to automatically generate synthetic data.

Process:

Configure generation rules and templates
Process document chunks through LLM
Generate questions, answers, or variations
Review and validate output
Add to knowledge base

Pros:

Fast and scalable
Consistent format
Covers large volumes

Cons:

May require validation
Can generate incorrect information
Needs quality control

Semi-Automated Generation

Combine AI generation with human review.

Process:

Auto-generate candidates
Human review and editing
Approval workflow
Integration into knowledge base

Pros:

Better quality control
Maintains accuracy
Allows expert refinement

Cons:

More time-intensive
Requires human resources
Slower to scale

Manual Curation

Manually create synthetic examples based on real usage.

Process:

Analyze user queries
Identify gaps in coverage
Manually create Q&A pairs
Add examples and scenarios

Pros:

Highest quality
Addresses real user needs
Expert-validated

Cons:

Time-consuming
Limited scalability
Requires domain expertise

Best Practices

Quality Over Quantity

Focus on high-quality, accurate synthetic data
Validate generated content before adding to knowledge base
Remove or fix incorrect synthetic data
Regular quality audits

Maintain Source Attribution

Link synthetic data to original sources
Track generation method and date
Enable easy updating when source changes
Allow filtering by data type (original vs. synthetic)

Balance Original and Synthetic

Don't let synthetic data overwhelm original content
Maintain 60-70% original, 30-40% synthetic ratio
Prioritize original content in retrieval
Use synthetic data to enhance, not replace

Version Control

Track versions of synthetic data
Link to source document versions
Update when sources change
Archive outdated synthetic data

Continuous Improvement

Monitor which synthetic data gets used
Remove unused synthetic examples
Generate new data based on gaps
A/B test with and without synthetic data

Configuration Examples

Question Generation Config

{
  "syntheticData": {
    "questionGeneration": {
      "enabled": true,
      "questionsPerChunk": 3,
      "questionTypes": ["what", "how", "why", "when"],
      "difficultyLevels": ["basic", "intermediate"],
      "includeContext": true
    }
  }
}

Paraphrasing Config

{
  "syntheticData": {
    "paraphrasing": {
      "enabled": true,
      "variationsPerChunk": 2,
      "styles": ["formal", "casual"],
      "preserveTechnicalTerms": true
    }
  }
}

Measuring Impact

Metrics to Track

Coverage: Percentage of queries finding relevant synthetic data
Retrieval Improvement: Accuracy increase with synthetic data
User Satisfaction: Feedback on responses using synthetic data
Usage Rate: How often synthetic vs. original data is retrieved

A/B Testing

Run experiments to validate synthetic data value:

Control Group: Users without synthetic data
Test Group: Users with synthetic data
Compare: Response quality, user satisfaction, retrieval accuracy

Quality Metrics

Accuracy Rate: Percentage of accurate synthetic data
Relevance Score: How relevant synthetic data is to queries
Freshness: Age of synthetic data vs. source documents

Common Pitfalls

Over-Generation

Problem: Too much synthetic data dilutes quality
Solution: Set limits and focus on high-value additions

Inaccuracy

Problem: Generated content contradicts source material
Solution: Implement validation and review processes

Staleness

Problem: Synthetic data becomes outdated
Solution: Regular regeneration tied to source updates

Loss of Context

Problem: Generated content lacks necessary context
Solution: Include surrounding information and metadata

Hallucination

Problem: LLMs generate plausible but false information
Solution: Strict validation against source material

Tools and Techniques

LLM Prompts for Question Generation

Given the following document excerpt, generate 3 relevant questions 
that users might ask about this content. Ensure questions are:
- Specific and answerable from the excerpt
- Varied in type (what, how, why)
- Natural and conversational

Document excerpt:
[Your content here]

Format your response as:
1. [Question 1]
2. [Question 2]
3. [Question 3]

Validation Prompts

Review the following generated question and answer pair. 
Check if the answer is:
- Accurate according to the source
- Complete and helpful
- Free from hallucinations

Source: [Original content]
Question: [Generated question]
Answer: [Generated answer]

Is this Q&A pair accurate? (Yes/No)
If No, explain the issue:

Advanced Techniques

Multi-Document Synthesis

Generate synthetic data that combines information from multiple sources:

Cross-reference related concepts
Create comprehensive Q&A from scattered info
Build workflows combining multiple docs

Adaptive Generation

Automatically generate synthetic data based on query patterns:

Monitor failed queries
Identify coverage gaps
Generate targeted synthetic content
Close knowledge base gaps

Persona-Based Generation

Create variations for different user types:

{
  "personas": [
    {
      "type": "technical",
      "tone": "formal",
      "detail": "high",
      "terminology": "technical"
    },
    {
      "type": "business",
      "tone": "professional",
      "detail": "medium",
      "terminology": "layman"
    }
  ]
}

Next Steps

Chunking Strategies - Optimize how your source data is split
Data Manipulations - Transform and enrich your data further

Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the ask query parameter:

GET /dev/product/data-prep/synthetic-data.md?ask=<question>

The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.

Key Takeaways

What is Synthetic Data?

Why Use Synthetic Data?

Coverage Enhancement

Improved Retrieval

Training and Testing

Types of Synthetic Data

Question Generation

Answer Generation

Paraphrasing and Variation

Scenario Expansion

Edge Case Generation

Implementation Strategies

Automated Generation

Semi-Automated Generation

Manual Curation

Best Practices

Quality Over Quantity

Maintain Source Attribution

Balance Original and Synthetic

Version Control

Continuous Improvement

Configuration Examples

Question Generation Config

Paraphrasing Config

Measuring Impact

Metrics to Track

A/B Testing

Quality Metrics

Common Pitfalls

Over-Generation

Inaccuracy

Staleness

Loss of Context

Hallucination

Tools and Techniques

LLM Prompts for Question Generation

Validation Prompts

Advanced Techniques

Multi-Document Synthesis

Adaptive Generation

Persona-Based Generation

Next Steps

Agent Instructions: Querying This Documentation

People also ask

Related Pages

Integrations

Industries

Compliance

Investors

Industry