How to Set Up Airbyte Agents for Multi-Source Data Orchestration in 2025
How to Set Up Airbyte Agents for Multi-Source Data Orchestration in 2025
Data engineers and backend developers increasingly face a complex problem: orchestrating data flows across multiple disconnected sources without context loss. Traditional ETL pipelines treat each source independently, creating gaps when you need intelligent decisions about which data to sync, when to sync it, and how to handle conflicts.
Airbyte Agents solve this by providing context-aware orchestration—agents that understand your entire data landscape and make intelligent routing decisions. This guide walks you through setting up Airbyte Agents for real multi-source scenarios.
Why Airbyte Agents Matter for Multi-Source Pipelines
Unlike standard Airbyte connectors that follow static configuration, Airbyte Agents are intelligent components that:
- Maintain cross-source context: Agents track state and relationships across all connected sources
- Make adaptive decisions: They determine optimal sync timing, batching, and retry strategies based on source health
- Handle dependencies: Automatically order syncs when downstream sources depend on upstream data completion
- Reduce manual interventions: Context awareness eliminates most edge cases requiring manual pipeline management
This is particularly valuable when integrating SaaS APIs (Salesforce, HubSpot), databases (PostgreSQL, MongoDB), and warehouses (Snowflake, BigQuery) simultaneously.
Prerequisites
Before setting up Airbyte Agents:
- Airbyte Cloud or self-hosted instance (v0.50+)
- At least 2 configured data sources (connectors already set up)
- Basic familiarity with JSON-based configuration
- Access to your data warehouse or destination
- API credentials for your sources (ready to authenticate)
Step 1: Enable Agent Mode in Your Airbyte Workspace
Agent functionality isn't enabled by default. Access your Airbyte workspace settings:
- Navigate to Settings → Advanced
- Toggle "Enable Experimental Agents" (available in Airbyte v0.50+)
- Confirm workspace reload
Once enabled, you'll see a new "Agents" tab in the left sidebar alongside Connections and Sources.
Step 2: Define Agent Context Schema
Agents need to understand relationships between your sources. Create a context schema that maps:
{
"sources": [
{
"id": "salesforce-crm",
"type": "salesforce",
"priority": "high",
"refresh_interval_hours": 4,
"depends_on": []
},
{
"id": "postgres-transactions",
"type": "postgres",
"priority": "critical",
"refresh_interval_hours": 1,
"depends_on": []
},
{
"id": "hubspot-contacts",
"type": "hubspot",
"priority": "medium",
"refresh_interval_hours": 6,
"depends_on": ["salesforce-crm"]
}
],
"join_conditions": [
{
"left_source": "salesforce-crm",
"right_source": "hubspot-contacts",
"left_key": "email",
"right_key": "email"
}
],
"conflict_resolution": "timestamp"
}
This configuration tells Airbyte Agents:
- HubSpot syncs only after Salesforce completes
- Records are joined on email fields
- When conflicts occur, the most recent timestamp wins
Step 3: Create Agent Configurations
Navigate to Agents → Create New Agent:
- Name:
multi-source-daily-sync - Mode: Select "Multi-Source Orchestrator"
- Paste your context schema from Step 2
- Assign connectors: Select your Salesforce, PostgreSQL, and HubSpot connections
- Set failure behavior: Choose "pause-dependent-sources" (recommended for production)
The agent now understands your source topology and can optimize sync order.
Step 4: Configure Intelligence Rules
Intelligence rules allow agents to make decisions beyond static scheduling:
| Rule Type | Use Case | Example | |-----------|----------|----------| | Source Health Gating | Skip downstream syncs if upstream fails | If Salesforce sync <90% successful in last 24h, pause HubSpot | | Volume-Based Throttling | Adjust sync timing based on data size | Trigger HubSpot sync only if Salesforce extracted >1000 records | | Time-Window Optimization | Sync only during low-traffic periods | PostgreSQL syncs only between 2-4 AM UTC | | Dedupe Detection | Prevent redundant syncs | Skip sync if previous one completed <30 min ago |
Add these rules in the Intelligence tab:
{
"rules": [
{
"name": "salesforce-health-gate",
"condition": "source_success_rate < 0.9",
"action": "skip_dependent_sources",
"lookback_hours": 24
},
{
"name": "postgres-volume-trigger",
"condition": "postgres_extracted_records > 1000",
"action": "trigger_hubspot_sync",
"cooldown_minutes": 30
}
]
}
Step 5: Set Up Monitoring and Error Handling
Agents generate detailed execution logs. Configure alerts:
- Go to Agents → Monitoring
- Enable "Context Loss Detection" - alerts when agent loses source state
- Set "Retry Policy" - exponential backoff starting at 30 seconds
- Configure "Webhook Notifications" for slack integration (paste your Slack webhook URL)
For production, integrate with your observability stack:
# Example: Ship agent logs to DataDog
export AIRBYTE_AGENT_LOG_DESTINATION="datadog"
export DATADOG_API_KEY="your-key-here"
export DATADOG_SITE="us3.datadoghq.com"
Step 6: Test Multi-Source Sync Behavior
Before running production, validate agent behavior:
- Trigger manual sync: Click "Test Run" in agent details
- Verify sync order: Check that HubSpot waits for Salesforce completion
- Inspect logs: Review Execution Timeline to confirm context was maintained
- Check data quality: Validate joined records in your warehouse
Look for this pattern in logs:
[Agent] Starting sync for source: salesforce-crm (priority: high)
[Agent] Waiting for completion...
[Agent] Salesforce sync completed: 5,234 records extracted
[Agent] Dependency satisfied: hubspot-contacts cleared to sync
[Agent] Starting sync for source: hubspot-contacts (priority: medium)
Common Pitfalls and Solutions
Problem: Agent syncs all sources simultaneously instead of respecting dependencies.
- Solution: Verify
depends_onarray in context schema is properly formatted. Agents don't infer dependencies automatically.
Problem: "Context Loss" alerts trigger frequently.
- Solution: Increase agent timeout values in Advanced Settings. Default 5-minute timeout may be insufficient for slow sources like Salesforce.
Problem: Join conditions fail silently.
- Solution: Validate that join keys (email, ID, etc.) actually exist in both source schemas. Use
describe_tablequeries to confirm field names.
Production Checklist
- [ ] Agent schema tested with at least 3 sources
- [ ] Webhook notifications configured and tested
- [ ] Failure scenarios (source down, network timeout) tested manually
- [ ] Rollback plan documented (how to revert to individual connections)
- [ ] Monitoring dashboards created for agent performance metrics
- [ ] Data quality validation queries written for joined datasets
- [ ] Team training completed on agent logs and troubleshooting
Next Steps
Once stable, expand your agent setup:
- Add more sources: Scale to 5+ sources; agents maintain performance
- Custom transformation rules: Add dbt-based transformations between sources
- Predictive scheduling: Use historical sync patterns for ML-optimized timing
- Cost optimization: Agent insights can identify which syncs could run less frequently
Airbyte Agents transform multi-source orchestration from a manual scripting problem into an intelligent, context-aware system. Start small with 2-3 sources, validate the dependency model, then expand confidently.