ETL vs ELT
Why the modern data stack moved the Transform step AFTER the Load — out of a separate engine and into cheap, elastic cloud-warehouse compute that analysts can own with SQL.
What you'll learn
- Why ETL transforms BEFORE loading and ELT transforms AFTER, in the warehouse
- The one architectural change — decoupling storage from compute — that made ELT viable
- Who does what: Fivetran/Airbyte do EL, dbt does the T, Snowflake/BigQuery/Redshift are the warehouse
- Why ELT won analytics: elastic compute, retained raw data, SQL transforms analysts can version
- When ETL is still correct: PII masked before landing, heavy non-SQL transforms, low-latency streaming
Before you start
Three letters, two orderings. ETL and ELT use the same three steps — Extract, Load, Transform — and differ only in where the T sits. It sounds like a trivia question. It is actually the single biggest architectural shift in analytics of the last decade: the move from transform-before-load on a dedicated engine to load-then-transform inside the warehouse. Get this one idea and the whole “modern data stack” — Fivetran, dbt, Snowflake — snaps into focus.
The letters, and what actually moves
Both patterns extract data from sources (a Postgres database, a Stripe API, a pile of CSVs) and both end with transformed, query-ready tables. The difference is the order of the last two steps — and that reorder quietly changes where compute runs, whether raw data survives, and who owns the transform.
A common trap is to read ELT as “ETL with the letters shuffled, basically the same thing.” It is not. Moving the T after the L changes three things at once, and the next section is where it clicks — so play with the widget before you read on.
See it: the same pipeline, two orderings
Why ETL came first
ETL was not a mistake — it was the right answer for its era. On-premise data warehouses had fixed, scarce compute and expensive storage. The warehouse was tuned to serve queries, not to grind through heavy data cleaning. So you bought a separate box — an ETL server running Informatica PowerCenter, Microsoft SSIS, Talend, or IBM DataStage — and did the heavy lifting there, before loading.
That ordering forces schema-on-write: data is validated, cleaned, and conformed to the target schema before it lands. A nice property falls out of this — a breaking schema change fails loudly at the transform step, so garbage never reaches the warehouse. The cost is steep: the tools were proprietary and GUI-driven, only specialized ETL developers could change a pipeline, and the raw data was thrown away after transforming. Want to re-derive a metric a new way next quarter? You re-extract from the source. The warehouse only ever held the one shape someone decided on up front.
The one change that flipped everything
ELT did not win because someone preferred a different acronym. It won because of a single architectural shift in cloud warehouses: storage and compute were decoupled.
In the old world, storage and compute were welded together — more data meant a bigger, pricier box whether or not you needed the horsepower. Cloud warehouses separated them, so you can park cheap raw data in object storage and spin up elastic compute on demand to transform it in place, then spin that compute back down. Once warehouse compute is cheap and elastic, the reason ETL existed — “the warehouse can’t handle heavy transforms” — simply evaporates. So you stop pre-processing on a separate engine and just transform inside the warehouse.
This is real architecture, not marketing. Snowflake stores data compressed in micro-partitions on cloud object storage (S3, Azure Blob, or GCS) and runs queries on independent virtual warehouses — compute clusters you can resize, pause, and run concurrently against the same data; pausing one does not touch data availability. Google BigQuery is serverless: storage (Colossus) and compute (the Dremel query engine) scale independently, connected by Google’s Jupiter network, with no cluster to manage. One honest nuance: Amazon Redshift did not always work this way — its original DC/DS nodes tightly coupled compute and storage, and decoupling arrived only later with RA3 nodes backed by Redshift Managed Storage (local SSD as a hot cache, S3 underneath). So “cloud warehouses always separated storage and compute” is a half-truth — Snowflake and BigQuery did by design; Redshift caught up.
Who does what in the modern stack
Here is where people misattribute tools, so be precise. The modern ELT stack splits the three letters across three categories of tool:
| Layer | Tools | What they actually do |
|---|---|---|
| EL (Extract + Load) | Fivetran, Airbyte | Replicate data from sources into the warehouse. They do not do business-logic transforms. |
| T (Transform) | dbt | Transforms data already in the warehouse with SQL. It does not move data or connect to sources. |
| Warehouse | Snowflake, BigQuery, Redshift, Databricks | The elastic engine where the load lands and the transform runs. |
Fivetran and Airbyte are connectors. Fivetran is the managed option — hundreds of pre-built connectors, most using Change Data Capture for low-impact incremental syncs, auto-maintained as source APIs drift. Airbyte is the open-source (open-core) option — hundreds of replication connectors, a no-code Connector Builder, and a low-code CDK. Either way, their job ends the moment raw data lands in the warehouse.
dbt is the Transform layer, and it is the reason analysts took
ownership of the T. dbt transforms are just SQL SELECT statements,
version-controlled in Git, organized as a DAG (commonly
staging → intermediate → marts), with built-in tests, documentation,
and lineage. It executes that SQL directly on the warehouse engine — no
intermediate processing layer — so performance equals hand-written SQL.
A subtle but important misconception to kill: dbt does not extract or
load anything, and ELT does not skip the transform. The T did not
disappear; it moved — out of a proprietary engine, into the warehouse,
into SQL an analyst can own.
So the slogan: Fivetran/Airbyte do the EL, dbt does the T, the warehouse runs it all.
Why ELT won analytics
Pull the threads together:
- Elastic, cheap warehouse compute makes in-warehouse transforms practical — the original reason for a separate engine is gone.
- Raw data is retained. When requirements change, you re-transform from the raw layer instead of re-extracting from source. The warehouse remembers everything; you can always re-derive a metric.
- SQL transforms can be owned by analysts. Version-controlled, tested, documented (dbt) — software-engineering discipline applied to analytics, by the people closest to the business questions.
- Simpler pipelines. EL is commoditized down to connectors; the T is decoupled and lives in one place.
ELT is the prevailing default for cloud analytics — and the stack is consolidating. Fivetran and dbt Labs completed an all-stock merger (agreement signed October 2025), folding EL ingestion and the Transform layer into a single vendor that the companies position as “the data infrastructure layer for trusted AI agents” (they report serving 100,000+ data teams — a company figure, not an audited one). On the same day, dbt Core v2.0 shipped as an early alpha under Apache 2.0, open-sourcing much of the Rust-based Fusion engine and unifying dbt Core and Fusion on one runtime. Separately, Fivetran acquired the reverse-ETL vendor Census around May 2025. The vendors are merging, but note the mechanism underneath is unchanged: load raw, transform in-warehouse with SQL.
When ETL is still the right answer
ELT is not a law of nature, and ETL is not obsolete. Reach for the old ordering — transform before load — in three cases:
- PII that must be masked or tokenized BEFORE it lands. Under GDPR, HIPAA, PCI-DSS, or CCPA you often cannot let raw sensitive data sit in the warehouse’s raw layer at all. Masking it during ETL keeps it out entirely, and static pre-masking is far cheaper at query time than masking dynamically on every read.
- Heavy non-SQL transformations that warehouse SQL is poorly suited to express.
- Low-latency / real-time streaming, where you transform in flight rather than waiting for data to batch-land. Apache Flink consuming Apache Kafka topics — filter, enrich, join, window with state, output in milliseconds-to-seconds — is the classic shape here.
And be honest about ELT’s one real downside: because raw loads before validation, data-quality problems surface downstream, in the marts or in a dashboard, rather than being filtered at ingestion. That is precisely why dbt embeds tests in the transform layer. In practice, many real teams run a hybrid — pre-mask PII on the way in, then do everything else in-warehouse.
One more distinct pattern worth a name: reverse ETL runs the other direction — it copies modeled data from the warehouse back into operational SaaS tools (Salesforce, HubSpot, Zendesk, ad platforms) so business teams can act on it, a move often called “data activation.” It is not ETL backwards with the same tools; it is its own category (pioneered by Census and Hightouch) and it requires a warehouse as the source.
Quick check
Quick check
The one-line takeaway
ELT did not just reorder three letters. Moving Transform after Load — out of a separate engine and into cheap, elastic warehouse compute — kept the raw data around and handed the transform to analysts writing SQL. That reorder is the modern data stack.
Practice this in an interview
All questionsETL transforms data before loading it into the destination, which was necessary when warehouses were expensive and compute-constrained. ELT loads raw data first and transforms inside the warehouse, leveraging cheap cloud compute and making raw data available for reprocessing. ELT is the default in modern cloud stacks; ETL still makes sense when you must mask sensitive fields before they ever land in the warehouse.
A data warehouse stores structured, schema-on-write data optimized for SQL analytics but is expensive for raw or unstructured data. A data lake stores any format cheaply on object storage but lacks ACID transactions and query performance. A lakehouse layers open table formats (Delta Lake, Iceberg, Hudi) on object storage to deliver warehouse-grade performance and ACID semantics at data lake costs — it is the dominant architecture in 2026.
Spark does not execute any computation when you call a transformation — it builds a DAG of logical steps. Only when you call an action does Spark compile that DAG into physical tasks and execute them. This design lets Catalyst optimize the full query plan before touching any data.
OLTP (Online Transaction Processing) systems handle high-throughput, low-latency reads and writes for individual records — think order placement, user authentication. OLAP (Online Analytical Processing) systems handle complex aggregations over millions of rows for business intelligence. Running heavy analytics directly on an OLTP database locks rows, competes for I/O, and slows application queries that customers feel.