Reverse ETL
The warehouse has your best data — joined, modeled, scored — but it's trapped in dashboards. Reverse ETL syncs that modeled data back OUT to the operational tools where work happens: Salesforce, Slack, ad platforms. Operational analytics, idempotent upserts, and where it fits vs ETL and CDC.
What you'll learn
- Why the warehouse's best data is useless if it's stuck in a BI dashboard
- What reverse ETL is — syncing modeled tables back out to operational SaaS tools
- Why every sync is an idempotent, incremental upsert keyed by an external id
- How it differs from ETL/ELT and CDC — direction, target, and source of truth
- The hard parts — API rate limits, identity mapping, and sync observability
Before you start
Everything so far has pointed one direction: ETL, ELT, and CDC all move data into the warehouse, where you model it, version it, and store it columnar. And there it sits — joined across every source, enriched with computed scores and segments, more complete than any single app’s database. The problem is who can act on it. A sales rep lives in Salesforce. A support agent lives in Zendesk. A marketer lives in the ad platforms. None of them work in your BI dashboard. Reverse ETL is the last mile: it syncs modeled data back out of the warehouse into the operational tools where people actually do their jobs.
The “last mile” problem
You build a beautiful churn-risk model. It joins product usage, support
tickets, and billing into a single health_score per account — exactly
what a Customer Success Manager needs. But it’s a column in a warehouse
table. The CSM never opens the warehouse; they open Salesforce. So the
score sits unused, and the company keeps churning customers it could have
saved. This gap — great data on one side, the people who’d act on it on the
other — is what reverse ETL (also called operational analytics or
data activation) closes.
How a sync works
The shape is the mirror image of ETL. The warehouse is the source; a SaaS tool’s API is the destination. You:
- Write a
SELECTagainst a modeled table (often a “gold” table) — the rows and columns you want to push. - Map warehouse columns to the destination’s fields
(
health→Health_Score__cin Salesforce). - Sync — but as an idempotent upsert keyed by an external id, so re-running never duplicates; it overwrites the matching record in place.
- Run incrementally — push only rows that changed since the last sync, to respect API rate limits.
Pick a destination and watch the same modeled rows take different operational shapes:
Sync modeled data back out to the tools teams use
The warehouse is the source now. Pick a destination — the field mapping changes — and notice every sync is an idempotent upsert, keyed by an external id, so re-running it is safe.
| name | health | segment | mrr |
|---|---|---|---|
| Acme● | 38 | at-risk | 900 |
| Globex | 82 | healthy | 4000 |
| Initech● | 55 | neutral | 700 |
| Umbrella | 91 | healthy | 1200 |
Syncing 2 records (only rows that changed since last run) to Salesforce. Each row upserts by External_Id__c, so a re-run overwrites in place rather than duplicating.
Notice three things in that widget. The mapping changes per destination
(Salesforce wants Account fields; Slack wants a templated message). The
sync is an upsert, not a dump — keyed by External_Id__c, so it’s safe
to re-run. And incremental mode pushes only the changed rows, which is
how you stay under an API’s request budget.
How it differs from ETL and CDC
| ETL / ELT / CDC | Reverse ETL | |
|---|---|---|
| Direction | sources → warehouse | warehouse → SaaS tools |
| Destination | a database/warehouse you control | a third-party API with rate limits |
| Source of truth | the operational apps | the warehouse model |
| Unit of work | load/transform tables | upsert records into an object |
That “destination is an API” row is the one that makes reverse ETL its own
discipline. You can’t just INSERT into Salesforce — you call its REST API,
which has rate limits, required fields, validation rules, and its own
notion of identity. The dedicated tools (Hightouch, Census, and
increasingly warehouse-native features) exist to handle exactly that:
batching, retries, field mapping, and identity resolution against dozens of
finicky APIs.
The hard parts
Reverse ETL’s difficulties come from the destination, not the SQL. Rate limits force batching and incremental syncs. Identity mapping is fiddly — which warehouse row is this Salesforce account? (hence the external-id key). Idempotency is non-negotiable because syncs retry. And observability matters more than in normal ETL: when a sync silently fails, a human downstream is now acting on stale data, so you need per-record success/failure reporting, not just “the job ran.”
Quick check
Quick check
Next
You now have a full data platform: sources stream in via ETL and CDC, get modeled and stored columnar, and flow back out via reverse ETL. But all these jobs have to run — in the right order, on a schedule, recovering from failures. The system that coordinates them is an orchestrator. That’s the final piece, next.
Questions about this lesson
What is reverse ETL?
Reverse ETL syncs modeled data from the warehouse back out to operational SaaS tools — Salesforce, Slack, ad platforms — where people actually work. It is the mirror image of ETL: the warehouse is the source and a third-party API is the destination. It closes the last-mile gap so that scores and segments built in the warehouse reach the point of action instead of staying trapped in dashboards.
How is reverse ETL different from ETL?
ETL and CDC move data from operational sources into the warehouse; reverse ETL moves modeled data from the warehouse out to operational tools. The destination is a rate-limited API rather than a database you control, the warehouse is the source of truth, and the unit of work is an idempotent upsert into a SaaS object keyed by an external id.
What tools are used for reverse ETL?
The dedicated platforms are Hightouch and Census, plus a growing set of warehouse-native features. They exist to handle the hard part — batching, retries, field mapping, and identity resolution against dozens of finicky destination APIs — so a sync is reliable and idempotent rather than a brittle one-off integration.
Practice this in an interview
All questionsETL transforms data before loading it into the destination, which was necessary when warehouses were expensive and compute-constrained. ELT loads raw data first and transforms inside the warehouse, leveraging cheap cloud compute and making raw data available for reprocessing. ELT is the default in modern cloud stacks; ETL still makes sense when you must mask sensitive fields before they ever land in the warehouse.
CDC continuously captures row-level inserts, updates, and deletes from a source database and streams them downstream — enabling near-real-time replication to a warehouse or data lake without full table scans. The most robust implementation reads the database's write-ahead log (WAL), making it low-impact on the source and capable of capturing deletes that polling-based approaches miss entirely.
OLTP (Online Transaction Processing) systems handle high-throughput, low-latency reads and writes for individual records — think order placement, user authentication. OLAP (Online Analytical Processing) systems handle complex aggregations over millions of rows for business intelligence. Running heavy analytics directly on an OLTP database locks rows, competes for I/O, and slows application queries that customers feel.
A data warehouse stores structured, schema-on-write data optimized for SQL analytics but is expensive for raw or unstructured data. A data lake stores any format cheaply on object storage but lacks ACID transactions and query performance. A lakehouse layers open table formats (Delta Lake, Iceberg, Hudi) on object storage to deliver warehouse-grade performance and ACID semantics at data lake costs — it is the dominant architecture in 2026.