Product
Synthetic Data
Synthetic data generation enhances your knowledge base by creating additional training examples, questions, and variations of existing content
TL;DR
Synthetic data generation enhances your knowledge base by creating additional training examples, questions, and variations of existing content. This improves your AI agent's ability to understand and respond to a wider range of user queries.
Key Takeaways
- What is Synthetic Data?
- Why Use Synthetic Data?
- Types of Synthetic Data
- Implementation Strategies
- Best Practices
- Configuration Examples
Synthetic data generation enhances your knowledge base by creating additional training examples, questions, and variations of existing content. This improves your AI agent's ability to understand and respond to a wider range of user queries.
What is Synthetic Data?
Synthetic data refers to artificially generated content that supplements your original data. In the context of AI agents, this typically includes:
- Question-Answer Pairs: Generated questions based on your documents
- Paraphrased Content: Alternative phrasings of existing information
- Edge Cases: Variations covering uncommon query patterns
- Expanded Examples: Additional context and use cases
Why Use Synthetic Data?
Coverage Enhancement
Original documentation often doesn't cover all possible ways users might ask questions. Synthetic data fills these gaps by:
- Generating multiple question variations for each concept
- Creating questions for different expertise levels
- Covering different phrasings and terminology
- Addressing implicit questions not explicitly stated in docs
Improved Retrieval
More diverse data improves semantic search:
- Better embedding coverage of semantic space
- Higher likelihood of matching user queries
- Reduced dependency on exact keyword matches
- Improved ranking of relevant results
Training and Testing
Synthetic data enables better evaluation:
- Create test sets for measuring accuracy
- Generate scenarios for stress testing
- Validate coverage of key topics
- Benchmark performance improvements
Types of Synthetic Data
Question Generation
Automatically generate questions from your existing content.
How It Works:
- Analyze document chunks
- Identify key concepts and facts
- Generate relevant questions using LLMs
- Associate questions with source content
Example:
Original content:
Twig AI supports integration with Slack, Microsoft Teams,
and Zendesk. Each integration requires OAuth authentication
and can be configured from the Integrations page.
Generated questions:
- "What integrations does Twig AI support?"
- "How do I connect Twig AI to Slack?"
- "What authentication method is used for integrations?"
- "Where can I configure integrations in Twig AI?"
Benefits:
- Improves retrieval for question-style queries
- Covers different ways of asking the same thing
- Explicitly links questions to answers
Answer Generation
Create complete Q&A pairs from documentation.
How It Works:
- Generate questions as above
- Extract or generate concise answers
- Include source references
- Validate accuracy
Example Structure:
{
"question": "How do I connect Twig AI to Slack?",
"answer": "To connect Twig AI to Slack, go to the Integrations page and select Slack. You'll be prompted to authenticate using OAuth. Once authenticated, you can configure which channels to monitor.",
"source": "integrations-guide.md",
"section": "Slack Integration"
}
Benefits:
- Provides ready-to-use Q&A format
- Optimized for direct answering
- Easier to validate and edit
Paraphrasing and Variation
Generate alternative phrasings of existing content.
Use Cases:
- Different terminology (technical vs. layman)
- Various language styles (formal vs. casual)
- Different expertise levels (beginner vs. advanced)
- Regional variations (US vs. UK English)
Example:
Original:
"Initialize the SDK by providing your API key and organization ID."
Variations:
- "Start using the SDK by entering your API key and org ID."
- "Set up the SDK with your credentials: API key and organization ID."
- "To begin, configure the SDK using your API key and organization identifier."
Scenario Expansion
Create examples and use cases that illustrate concepts.
How It Works:
- Identify abstract concepts
- Generate concrete examples
- Create step-by-step scenarios
- Include expected outcomes
Example:
Original concept:
"You can filter data sources by category and date range."
Expanded scenario:
Example: Filtering Customer Support Tickets
1. Navigate to Data Sources
2. Select "Category" filter
3. Choose "Customer Support"
4. Set date range: Last 30 days
5. Click "Apply Filters"
Result: Only customer support tickets from the past month will be displayed,
making it easier to train your agent on recent support interactions.
Edge Case Generation
Create examples covering unusual or complex scenarios.
Types of Edge Cases:
- Error conditions
- Unusual input formats
- Boundary conditions
- Complex multi-step workflows
- Integration failure scenarios
Example:
Standard case:
"How do I upload a document?"
Edge cases:
- "What happens if my document upload fails?"
- "Can I upload documents larger than 10MB?"
- "What if my PDF is password-protected?"
- "How do I handle upload timeouts?"
Implementation Strategies
Automated Generation
Use LLMs to automatically generate synthetic data.
Process:
- Configure generation rules and templates
- Process document chunks through LLM
- Generate questions, answers, or variations
- Review and validate output
- Add to knowledge base
Pros:
- Fast and scalable
- Consistent format
- Covers large volumes
Cons:
- May require validation
- Can generate incorrect information
- Needs quality control
Semi-Automated Generation
Combine AI generation with human review.
Process:
- Auto-generate candidates
- Human review and editing
- Approval workflow
- Integration into knowledge base
Pros:
- Better quality control
- Maintains accuracy
- Allows expert refinement
Cons:
- More time-intensive
- Requires human resources
- Slower to scale
Manual Curation
Manually create synthetic examples based on real usage.
Process:
- Analyze user queries
- Identify gaps in coverage
- Manually create Q&A pairs
- Add examples and scenarios
Pros:
- Highest quality
- Addresses real user needs
- Expert-validated
Cons:
- Time-consuming
- Limited scalability
- Requires domain expertise
Best Practices
Quality Over Quantity
- Focus on high-quality, accurate synthetic data
- Validate generated content before adding to knowledge base
- Remove or fix incorrect synthetic data
- Regular quality audits
Maintain Source Attribution
- Link synthetic data to original sources
- Track generation method and date
- Enable easy updating when source changes
- Allow filtering by data type (original vs. synthetic)
Balance Original and Synthetic
- Don't let synthetic data overwhelm original content
- Maintain 60-70% original, 30-40% synthetic ratio
- Prioritize original content in retrieval
- Use synthetic data to enhance, not replace
Version Control
- Track versions of synthetic data
- Link to source document versions
- Update when sources change
- Archive outdated synthetic data
Continuous Improvement
- Monitor which synthetic data gets used
- Remove unused synthetic examples
- Generate new data based on gaps
- A/B test with and without synthetic data
Configuration Examples
Question Generation Config
{
"syntheticData": {
"questionGeneration": {
"enabled": true,
"questionsPerChunk": 3,
"questionTypes": ["what", "how", "why", "when"],
"difficultyLevels": ["basic", "intermediate"],
"includeContext": true
}
}
}
Paraphrasing Config
{
"syntheticData": {
"paraphrasing": {
"enabled": true,
"variationsPerChunk": 2,
"styles": ["formal", "casual"],
"preserveTechnicalTerms": true
}
}
}
Measuring Impact
Metrics to Track
- Coverage: Percentage of queries finding relevant synthetic data
- Retrieval Improvement: Accuracy increase with synthetic data
- User Satisfaction: Feedback on responses using synthetic data
- Usage Rate: How often synthetic vs. original data is retrieved
A/B Testing
Run experiments to validate synthetic data value:
- Control Group: Users without synthetic data
- Test Group: Users with synthetic data
- Compare: Response quality, user satisfaction, retrieval accuracy
Quality Metrics
- Accuracy Rate: Percentage of accurate synthetic data
- Relevance Score: How relevant synthetic data is to queries
- Freshness: Age of synthetic data vs. source documents
Common Pitfalls
Over-Generation
- Problem: Too much synthetic data dilutes quality
- Solution: Set limits and focus on high-value additions
Inaccuracy
- Problem: Generated content contradicts source material
- Solution: Implement validation and review processes
Staleness
- Problem: Synthetic data becomes outdated
- Solution: Regular regeneration tied to source updates
Loss of Context
- Problem: Generated content lacks necessary context
- Solution: Include surrounding information and metadata
Hallucination
- Problem: LLMs generate plausible but false information
- Solution: Strict validation against source material
Tools and Techniques
LLM Prompts for Question Generation
Given the following document excerpt, generate 3 relevant questions
that users might ask about this content. Ensure questions are:
- Specific and answerable from the excerpt
- Varied in type (what, how, why)
- Natural and conversational
Document excerpt:
[Your content here]
Format your response as:
1. [Question 1]
2. [Question 2]
3. [Question 3]
Validation Prompts
Review the following generated question and answer pair.
Check if the answer is:
- Accurate according to the source
- Complete and helpful
- Free from hallucinations
Source: [Original content]
Question: [Generated question]
Answer: [Generated answer]
Is this Q&A pair accurate? (Yes/No)
If No, explain the issue:
Advanced Techniques
Multi-Document Synthesis
Generate synthetic data that combines information from multiple sources:
- Cross-reference related concepts
- Create comprehensive Q&A from scattered info
- Build workflows combining multiple docs
Adaptive Generation
Automatically generate synthetic data based on query patterns:
- Monitor failed queries
- Identify coverage gaps
- Generate targeted synthetic content
- Close knowledge base gaps
Persona-Based Generation
Create variations for different user types:
{
"personas": [
{
"type": "technical",
"tone": "formal",
"detail": "high",
"terminology": "technical"
},
{
"type": "business",
"tone": "professional",
"detail": "medium",
"terminology": "layman"
}
]
}
Next Steps
- Chunking Strategies - Optimize how your source data is split
- Data Manipulations - Transform and enrich your data further
Agent Instructions: Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.
Perform an HTTP GET request on the current page URL with the ask query parameter:
GET /dev/product/data-prep/synthetic-data.md?ask=<question>
The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.
Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
People also ask
Related Pages
Last updated January 26, 2026


