datarekha

Reverse ETL

The warehouse has your best data — joined, modeled, scored — but it's trapped in dashboards. Reverse ETL syncs that modeled data back OUT to the operational tools where work happens: Salesforce, Slack, ad platforms. Operational analytics, idempotent upserts, and where it fits vs ETL and CDC.

7 min read Beginner SQL Lesson 26 of 27

What you'll learn

  • Why the warehouse's best data is useless if it's stuck in a BI dashboard
  • What reverse ETL is — syncing modeled tables back out to operational SaaS tools
  • Why every sync is an idempotent, incremental upsert keyed by an external id
  • How it differs from ETL/ELT and CDC — direction, target, and source of truth
  • The hard parts — API rate limits, identity mapping, and sync observability

Before you start

Everything so far has pointed one direction: ETL, ELT, and CDC all move data into the warehouse, where you model it, version it, and store it columnar. And there it sits — joined across every source, enriched with computed scores and segments, more complete than any single app’s database. The problem is who can act on it. A sales rep lives in Salesforce. A support agent lives in Zendesk. A marketer lives in the ad platforms. None of them work in your BI dashboard. Reverse ETL is the last mile: it syncs modeled data back out of the warehouse into the operational tools where people actually do their jobs.

The “last mile” problem

You build a beautiful churn-risk model. It joins product usage, support tickets, and billing into a single health_score per account — exactly what a Customer Success Manager needs. But it’s a column in a warehouse table. The CSM never opens the warehouse; they open Salesforce. So the score sits unused, and the company keeps churning customers it could have saved. This gap — great data on one side, the people who’d act on it on the other — is what reverse ETL (also called operational analytics or data activation) closes.

How a sync works

The shape is the mirror image of ETL. The warehouse is the source; a SaaS tool’s API is the destination. You:

  1. Write a SELECT against a modeled table (often a “gold” table) — the rows and columns you want to push.
  2. Map warehouse columns to the destination’s fields (healthHealth_Score__c in Salesforce).
  3. Sync — but as an idempotent upsert keyed by an external id, so re-running never duplicates; it overwrites the matching record in place.
  4. Run incrementally — push only rows that changed since the last sync, to respect API rate limits.

Pick a destination and watch the same modeled rows take different operational shapes:

TryReverse ETL

Sync modeled data back out to the tools teams use

The warehouse is the source now. Pick a destination — the field mapping changes — and notice every sync is an idempotent upsert, keyed by an external id, so re-running it is safe.

warehouse · gold.customerssource
namehealthsegmentmrr
Acme38at-risk900
Globex82healthy4000
Initech55neutral700
Umbrella91healthy1200
Salesforce · Accountupsert External_Id__c
NameAcmeHealth_Score__c38Plan__cProMRR__c900
NameInitechHealth_Score__c55Plan__cProMRR__c700

Syncing 2 records (only rows that changed since last run) to Salesforce. Each row upserts by External_Id__c, so a re-run overwrites in place rather than duplicating.

Notice three things in that widget. The mapping changes per destination (Salesforce wants Account fields; Slack wants a templated message). The sync is an upsert, not a dump — keyed by External_Id__c, so it’s safe to re-run. And incremental mode pushes only the changed rows, which is how you stay under an API’s request budget.

How it differs from ETL and CDC

ETL / ELT / CDCReverse ETL
Directionsources → warehousewarehouse → SaaS tools
Destinationa database/warehouse you controla third-party API with rate limits
Source of truththe operational appsthe warehouse model
Unit of workload/transform tablesupsert records into an object

That “destination is an API” row is the one that makes reverse ETL its own discipline. You can’t just INSERT into Salesforce — you call its REST API, which has rate limits, required fields, validation rules, and its own notion of identity. The dedicated tools (Hightouch, Census, and increasingly warehouse-native features) exist to handle exactly that: batching, retries, field mapping, and identity resolution against dozens of finicky APIs.

The hard parts

Reverse ETL’s difficulties come from the destination, not the SQL. Rate limits force batching and incremental syncs. Identity mapping is fiddly — which warehouse row is this Salesforce account? (hence the external-id key). Idempotency is non-negotiable because syncs retry. And observability matters more than in normal ETL: when a sync silently fails, a human downstream is now acting on stale data, so you need per-record success/failure reporting, not just “the job ran.”

Quick check

Quick check

0/3
Q1What problem does reverse ETL specifically solve?
Q2Why is every reverse-ETL sync designed as an idempotent upsert keyed by an external id, run incrementally?
Q3TRANSFER: A team complains their expensive 'propensity to buy' model has had zero business impact for six months. The model is accurate and runs nightly into a warehouse table. What's the most likely fix?

Next

You now have a full data platform: sources stream in via ETL and CDC, get modeled and stored columnar, and flow back out via reverse ETL. But all these jobs have to run — in the right order, on a schedule, recovering from failures. The system that coordinates them is an orchestrator. That’s the final piece, next.

FAQCommon questions

Questions about this lesson

What is reverse ETL?

Reverse ETL syncs modeled data from the warehouse back out to operational SaaS tools — Salesforce, Slack, ad platforms — where people actually work. It is the mirror image of ETL: the warehouse is the source and a third-party API is the destination. It closes the last-mile gap so that scores and segments built in the warehouse reach the point of action instead of staying trapped in dashboards.

How is reverse ETL different from ETL?

ETL and CDC move data from operational sources into the warehouse; reverse ETL moves modeled data from the warehouse out to operational tools. The destination is a rate-limited API rather than a database you control, the warehouse is the source of truth, and the unit of work is an idempotent upsert into a SaaS object keyed by an external id.

What tools are used for reverse ETL?

The dedicated platforms are Hightouch and Census, plus a growing set of warehouse-native features. They exist to handle the hard part — batching, retries, field mapping, and identity resolution against dozens of finicky destination APIs — so a sync is reliable and idempotent rather than a brittle one-off integration.

Practice this in an interview

All questions
What is the difference between ETL and ELT, and when should you choose each?

ETL transforms data before loading it into the destination, which was necessary when warehouses were expensive and compute-constrained. ELT loads raw data first and transforms inside the warehouse, leveraging cheap cloud compute and making raw data available for reprocessing. ELT is the default in modern cloud stacks; ETL still makes sense when you must mask sensitive fields before they ever land in the warehouse.

What is Change Data Capture (CDC) and how is it implemented?

CDC continuously captures row-level inserts, updates, and deletes from a source database and streams them downstream — enabling near-real-time replication to a warehouse or data lake without full table scans. The most robust implementation reads the database's write-ahead log (WAL), making it low-impact on the source and capable of capturing deletes that polling-based approaches miss entirely.

What is the difference between OLTP and OLAP systems, and why can't you run analytics on your production database?

OLTP (Online Transaction Processing) systems handle high-throughput, low-latency reads and writes for individual records — think order placement, user authentication. OLAP (Online Analytical Processing) systems handle complex aggregations over millions of rows for business intelligence. Running heavy analytics directly on an OLTP database locks rows, competes for I/O, and slows application queries that customers feel.

What are the differences between a data warehouse, a data lake, and a data lakehouse?

A data warehouse stores structured, schema-on-write data optimized for SQL analytics but is expensive for raw or unstructured data. A data lake stores any format cheaply on object storage but lacks ACID transactions and query performance. A lakehouse layers open table formats (Delta Lake, Iceberg, Hudi) on object storage to deliver warehouse-grade performance and ACID semantics at data lake costs — it is the dominant architecture in 2026.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Explore further

Related lessons

Skip to content