Product
Data Manipulations
Data manipulation encompasses the various transformations, enrichments, and processing techniques applied to your data to improve its quality, structure, and usability for AI agents
TL;DR
Data manipulation encompasses the various transformations, enrichments, and processing techniques applied to your data to improve its quality, structure, and usability for AI agents. This includes cleaning, formatting, enriching with metadata, and optimizing for retrieval.
Key Takeaways
- Overview
- Core Manipulation Techniques
- Metadata Enrichment
- Advanced Manipulations
- Content Enhancement
- Deduplication
Data manipulation encompasses the various transformations, enrichments, and processing techniques applied to your data to improve its quality, structure, and usability for AI agents. This includes cleaning, formatting, enriching with metadata, and optimizing for retrieval.
Overview
Raw data rarely comes in the perfect format for AI consumption. Data manipulation techniques transform your content into an optimized form that enables:
- Better retrieval accuracy
- Improved response quality
- Enhanced filtering and categorization
- More efficient processing
Core Manipulation Techniques
Data Cleaning
Remove noise and inconsistencies from your data.
Common Cleaning Operations:
Whitespace Normalization
- Remove extra spaces, tabs, and newlines
- Standardize line endings
- Clean up formatting artifacts
Before: "This is a sentence.\n\n\n\nNext paragraph."
After: "This is a sentence.\n\nNext paragraph."
Character Encoding
- Fix encoding issues (UTF-8, ASCII)
- Handle special characters
- Normalize unicode variations
Before: "caf\u00e9" or "café"
After: "café"
HTML/XML Cleanup
- Strip HTML tags
- Decode HTML entities
- Remove CSS and JavaScript
- Extract meaningful text
Before: "<p>Hello <strong>World</strong>!</p>"
After: "Hello World!"
Noise Removal
- Remove boilerplate text (headers, footers)
- Strip navigation elements
- Delete advertising content
- Remove redundant copyright notices
Data Formatting
Standardize structure and format across your content.
Markdown Normalization
- Standardize heading styles
- Consistent list formatting
- Proper code block formatting
- Table standardization
Before:
# Heading
** Bold text **
- item 1
* item 2
After:
# Heading
**Bold text**
- item 1
- item 2
Date and Time Standardization
- Convert to ISO 8601 format
- Handle time zones consistently
- Parse various date formats
Before: "Jan 15, 2024", "15/01/2024", "2024-1-15"
After: "2024-01-15T00:00:00Z"
URL Normalization
- Standardize URL formats
- Remove tracking parameters
- Handle relative URLs
- Extract meaningful link text
Before: "https://example.com/page?utm_source=email&sessionid=123"
After: "https://example.com/page"
Text Transformations
Modify text content to improve processing.
Case Normalization
- Lowercase for case-insensitive matching
- Title case for headings
- Proper case for names
Punctuation Handling
- Standardize quotation marks
- Handle apostrophes consistently
- Remove or standardize special punctuation
Language Processing
- Stemming: Reduce words to root form
- Lemmatization: Convert to dictionary form
- Tokenization: Split into words/tokens
Before: "running", "ran", "runs"
After (stemmed): "run", "run", "run"
After (lemmatized): "run", "run", "run"
Abbreviation Expansion
- Expand common abbreviations
- Handle acronyms consistently
- Add full forms as metadata
Before: "API", "e.g.", "i.e."
After: "API (Application Programming Interface)", "for example", "that is"
Metadata Enrichment
Add contextual information to improve retrieval and filtering.
Source Metadata
Track where content originated:
{
"source_type": "confluence",
"source_url": "https://wiki.company.com/page/123",
"source_title": "API Documentation",
"author": "John Doe",
"last_modified": "2024-01-15T10:30:00Z",
"version": "2.1"
}
Content Classification
Automatically categorize content:
{
"category": "technical-documentation",
"subcategory": "api-reference",
"topics": ["authentication", "REST API", "OAuth"],
"complexity": "intermediate",
"content_type": "how-to"
}
Semantic Metadata
Add meaning and context:
{
"key_concepts": ["rate limiting", "API keys", "authentication"],
"related_topics": ["security", "developer-tools"],
"prerequisites": ["account setup", "API key generation"],
"target_audience": "developers"
}
Structural Metadata
Capture document structure:
{
"heading_hierarchy": ["Getting Started", "Authentication", "API Keys"],
"section_type": "setup-guide",
"reading_time_minutes": 5,
"code_blocks": 3,
"external_links": 2
}
Temporal Metadata
Track time-related information:
{
"created_at": "2023-06-01T00:00:00Z",
"updated_at": "2024-01-15T10:30:00Z",
"valid_from": "2024-01-01T00:00:00Z",
"expires_at": "2025-01-01T00:00:00Z",
"freshness_score": 0.95
}
Advanced Manipulations
Entity Extraction
Identify and extract key entities:
Types of Entities:
- People: Names, roles, contacts
- Organizations: Companies, departments, teams
- Products: Software, services, tools
- Locations: Offices, regions, data centers
- Technical Terms: APIs, protocols, technologies
Example:
{
"text": "Contact Jane Smith at jane@company.com for API access to our DataSync service.",
"entities": {
"people": [{"name": "Jane Smith", "email": "jane@company.com"}],
"products": ["DataSync"],
"topics": ["API access"]
}
}
Relationship Mapping
Identify connections between content pieces:
{
"document_id": "doc_123",
"relationships": [
{
"type": "prerequisite",
"target": "doc_045",
"description": "Setup guide required first"
},
{
"type": "related",
"target": "doc_234",
"description": "Advanced configuration options"
}
]
}
Intent Classification
Determine the purpose of content:
{
"primary_intent": "instructional",
"secondary_intents": ["troubleshooting", "reference"],
"action_items": ["setup", "configure", "test"],
"question_types_addressed": ["how-to", "what-is"]
}
Sentiment and Tone
Analyze content characteristics:
{
"tone": "formal",
"sentiment": "neutral",
"reading_level": "college",
"technical_density": "high"
}
Language Detection and Translation
Handle multilingual content:
{
"detected_language": "en",
"confidence": 0.99,
"has_translations": true,
"available_languages": ["en", "es", "fr"],
"translation_status": "complete"
}
Content Enhancement
Summary Generation
Create concise summaries for quick understanding:
{
"content": "... [full content] ...",
"summary": "This guide explains how to authenticate with the API using OAuth 2.0. It covers setup, token generation, and refresh workflows.",
"key_points": [
"OAuth 2.0 is the primary authentication method",
"Tokens expire after 1 hour",
"Refresh tokens are valid for 30 days"
]
}
Title and Heading Extraction
Identify and standardize titles:
{
"original_title": "api-auth-guide.md",
"extracted_title": "API Authentication Guide",
"main_heading": "Authenticating with the API",
"subheadings": [
"OAuth 2.0 Setup",
"Token Management",
"Best Practices"
]
}
Code Extraction and Annotation
Handle code snippets specially:
{
"code_blocks": [
{
"language": "python",
"code": "import requests\n...",
"purpose": "Example API authentication",
"line_numbers": [45, 52]
}
]
}
Link Processing
Extract and enrich hyperlinks:
{
"links": [
{
"url": "https://api.example.com/docs",
"text": "API Documentation",
"type": "external",
"status": "active",
"description": "Official API reference"
}
]
}
Deduplication
Remove duplicate or highly similar content.
Exact Duplicates
Remove identical content:
# Example logic
if content_hash(new_chunk) in existing_hashes:
skip_chunk()
else:
add_chunk()
Near Duplicates
Identify and merge similar content:
Techniques:
- Cosine similarity on embeddings
- Fuzzy string matching
- MinHash/LSH algorithms
similarity = cosine_similarity(embedding1, embedding2)
if similarity > 0.95:
merge_or_skip()
Version Consolidation
Handle multiple versions of the same document:
{
"consolidation_strategy": "latest",
"versions": [
{"id": "v1", "date": "2023-01-01", "action": "archive"},
{"id": "v2", "date": "2024-01-01", "action": "keep"}
]
}
Data Validation
Ensure data quality through validation.
Schema Validation
Verify data structure:
{
"required_fields": ["content", "source", "timestamp"],
"optional_fields": ["metadata", "tags"],
"validation_rules": {
"content": {"min_length": 10, "max_length": 10000},
"timestamp": {"format": "ISO8601"}
}
}
Content Validation
Check content quality:
- Minimum Length: Ensure chunks aren't too short
- Maximum Length: Prevent oversized chunks
- Language Check: Verify expected language
- Encoding Validation: Ensure proper encoding
Metadata Validation
Verify metadata completeness:
{
"metadata_completeness": 0.85,
"missing_fields": ["author", "category"],
"validation_status": "warning"
}
Filtering and Exclusion
Remove unwanted content systematically.
Content-Based Filtering
Exclude based on content characteristics:
{
"exclusion_rules": [
{"type": "length", "min": 50, "max": 5000},
{"type": "language", "allowed": ["en", "es"]},
{"type": "contains", "patterns": ["deprecated", "obsolete"]}
]
}
Source-Based Filtering
Filter by origin:
{
"excluded_sources": [
"internal-only-wiki",
"draft-documents"
],
"included_sources": [
"public-documentation",
"kb-articles"
]
}
Time-Based Filtering
Filter by freshness:
{
"age_limit_days": 365,
"exclude_before": "2023-01-01",
"only_updated_after": "2024-01-01"
}
Optimization Techniques
Embedding Optimization
Prepare content for optimal embeddings:
- Chunk Size: Optimal for embedding model
- Context Addition: Add titles/headings to chunks
- Metadata Inclusion: Include key metadata in embedded text
Original chunk: "Click the Save button to save your changes."
Optimized for embedding:
"[Configuration Settings > Saving Changes]
Click the Save button to save your changes to your account configuration."
Query Matching Optimization
Enhance content for better query matching:
- Question Format: Add question-style text
- Keyword Enrichment: Include relevant keywords
- Synonym Addition: Add alternative terms
Original: "Authentication requires an API key."
Optimized: "Authentication requires an API key. How to authenticate:
You need an API key to authenticate with the service.
Also known as: login, authorization, access token."
Hierarchical Structuring
Create parent-child relationships:
{
"parent_chunk": {
"id": "chunk_parent_123",
"content": "... [entire section] ...",
"type": "context"
},
"child_chunks": [
{
"id": "chunk_child_124",
"content": "... [specific subsection] ...",
"type": "retrievable"
}
]
}
Implementation Workflow
1. Data Ingestion
Raw Data → Parse → Extract Text → Initial Validation
2. Cleaning Pipeline
Raw Text → Remove Noise → Normalize → Fix Encoding → Clean HTML
3. Enrichment Pipeline
Clean Text → Extract Entities → Classify → Add Metadata → Generate Embeddings
4. Optimization Pipeline
Enriched Data → Deduplicate → Validate → Optimize → Index
5. Quality Assurance
Indexed Data → Sample Testing → Quality Metrics → Manual Review → Deployment
Best Practices
Processing Order
- Clean First: Remove noise before analysis
- Extract Then Enrich: Get base data before adding metadata
- Validate Throughout: Check quality at each stage
- Optimize Last: Final tuning after core processing
Idempotency
Ensure operations can be safely repeated:
- Same input → Same output
- Track processing versions
- Enable reprocessing
- Maintain audit trails
Scalability
Design for large-scale processing:
- Batch processing for efficiency
- Parallel processing where possible
- Incremental updates
- Efficient storage formats
Monitoring
Track manipulation effectiveness:
{
"processing_metrics": {
"documents_processed": 1000,
"success_rate": 0.98,
"avg_processing_time_ms": 150,
"errors": 20,
"warnings": 45
}
}
Common Pitfalls
Over-Processing
- Problem: Too many transformations lose original meaning
- Solution: Keep transformations minimal and reversible
Metadata Bloat
- Problem: Excessive metadata slows retrieval
- Solution: Focus on useful, frequently-filtered metadata
Loss of Context
- Problem: Aggressive cleaning removes important information
- Solution: Preserve key structural and contextual elements
Inconsistent Processing
- Problem: Different rules for different sources
- Solution: Standardize processing pipelines
Tools and Libraries
Text Processing
- NLTK: Natural language processing
- spaCy: Industrial-strength NLP
- Beautiful Soup: HTML parsing
- Pandas: Data manipulation
Data Cleaning
- ftfy: Fix text encoding
- unidecode: ASCII transliteration
- langdetect: Language detection
Metadata Extraction
- pdfplumber: PDF extraction
- docx: Word document parsing
- python-magic: File type detection
Next Steps
- Chunking Strategies - Optimize how data is split
- Synthetic Data - Enhance with generated content
- Data Sources - Learn about data ingestion
Agent Instructions: Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.
Perform an HTTP GET request on the current page URL with the ask query parameter:
GET /dev/product/data-prep/data-manipulation.md?ask=<question>
The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.
Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
Related Pages
Last updated January 26, 2026


