datarekha

ETL vs ELT

Why the modern data stack moved the Transform step AFTER the Load — out of a separate engine and into cheap, elastic cloud-warehouse compute that analysts can own with SQL.

8 min read Beginner SQL Lesson 24 of 27

What you'll learn

  • Why ETL transforms BEFORE loading and ELT transforms AFTER, in the warehouse
  • The one architectural change — decoupling storage from compute — that made ELT viable
  • Who does what: Fivetran/Airbyte do EL, dbt does the T, Snowflake/BigQuery/Redshift are the warehouse
  • Why ELT won analytics: elastic compute, retained raw data, SQL transforms analysts can version
  • When ETL is still correct: PII masked before landing, heavy non-SQL transforms, low-latency streaming

Before you start

Three letters, two orderings. ETL and ELT use the same three steps — Extract, Load, Transform — and differ only in where the T sits. It sounds like a trivia question. It is actually the single biggest architectural shift in analytics of the last decade: the move from transform-before-load on a dedicated engine to load-then-transform inside the warehouse. Get this one idea and the whole “modern data stack” — Fivetran, dbt, Snowflake — snaps into focus.

The letters, and what actually moves

Both patterns extract data from sources (a Postgres database, a Stripe API, a pile of CSVs) and both end with transformed, query-ready tables. The difference is the order of the last two steps — and that reorder quietly changes where compute runs, whether raw data survives, and who owns the transform.

ETL — transform before it landsSourceDB • API • filesTransformseparate engineLoadclean rowsWarehouseconformed onlyELT — load raw, transform in placeSourceDB • API • filesLoadraw rowsWarehouse (elastic compute)Raw layerretainedTransformSQL, in place
Same three steps. ETL transforms on a separate engine and loads only clean rows; ELT loads raw into the warehouse and transforms there.

A common trap is to read ELT as “ETL with the letters shuffled, basically the same thing.” It is not. Moving the T after the L changes three things at once, and the next section is where it clicks — so play with the widget before you read on.

See it: the same pipeline, two orderings

Why ETL came first

ETL was not a mistake — it was the right answer for its era. On-premise data warehouses had fixed, scarce compute and expensive storage. The warehouse was tuned to serve queries, not to grind through heavy data cleaning. So you bought a separate box — an ETL server running Informatica PowerCenter, Microsoft SSIS, Talend, or IBM DataStage — and did the heavy lifting there, before loading.

That ordering forces schema-on-write: data is validated, cleaned, and conformed to the target schema before it lands. A nice property falls out of this — a breaking schema change fails loudly at the transform step, so garbage never reaches the warehouse. The cost is steep: the tools were proprietary and GUI-driven, only specialized ETL developers could change a pipeline, and the raw data was thrown away after transforming. Want to re-derive a metric a new way next quarter? You re-extract from the source. The warehouse only ever held the one shape someone decided on up front.

The one change that flipped everything

ELT did not win because someone preferred a different acronym. It won because of a single architectural shift in cloud warehouses: storage and compute were decoupled.

In the old world, storage and compute were welded together — more data meant a bigger, pricier box whether or not you needed the horsepower. Cloud warehouses separated them, so you can park cheap raw data in object storage and spin up elastic compute on demand to transform it in place, then spin that compute back down. Once warehouse compute is cheap and elastic, the reason ETL existed — “the warehouse can’t handle heavy transforms” — simply evaporates. So you stop pre-processing on a separate engine and just transform inside the warehouse.

Coupled (classic)scale storage → you also pay computeStorage + Compute (one box)heavy transforms must runon a separate ETL engine→ ETLDecoupled (cloud)scale each one independentlyStoragecheap object storeComputeelastic, on demandspin up to transform in place,spin down when done→ ELT
Decoupling storage from compute is the load-bearing enabler. Cheap elastic compute makes transforming inside the warehouse practical.

This is real architecture, not marketing. Snowflake stores data compressed in micro-partitions on cloud object storage (S3, Azure Blob, or GCS) and runs queries on independent virtual warehouses — compute clusters you can resize, pause, and run concurrently against the same data; pausing one does not touch data availability. Google BigQuery is serverless: storage (Colossus) and compute (the Dremel query engine) scale independently, connected by Google’s Jupiter network, with no cluster to manage. One honest nuance: Amazon Redshift did not always work this way — its original DC/DS nodes tightly coupled compute and storage, and decoupling arrived only later with RA3 nodes backed by Redshift Managed Storage (local SSD as a hot cache, S3 underneath). So “cloud warehouses always separated storage and compute” is a half-truth — Snowflake and BigQuery did by design; Redshift caught up.

Who does what in the modern stack

Here is where people misattribute tools, so be precise. The modern ELT stack splits the three letters across three categories of tool:

LayerToolsWhat they actually do
EL (Extract + Load)Fivetran, AirbyteReplicate data from sources into the warehouse. They do not do business-logic transforms.
T (Transform)dbtTransforms data already in the warehouse with SQL. It does not move data or connect to sources.
WarehouseSnowflake, BigQuery, Redshift, DatabricksThe elastic engine where the load lands and the transform runs.

Fivetran and Airbyte are connectors. Fivetran is the managed option — hundreds of pre-built connectors, most using Change Data Capture for low-impact incremental syncs, auto-maintained as source APIs drift. Airbyte is the open-source (open-core) option — hundreds of replication connectors, a no-code Connector Builder, and a low-code CDK. Either way, their job ends the moment raw data lands in the warehouse.

dbt is the Transform layer, and it is the reason analysts took ownership of the T. dbt transforms are just SQL SELECT statements, version-controlled in Git, organized as a DAG (commonly staging → intermediate → marts), with built-in tests, documentation, and lineage. It executes that SQL directly on the warehouse engine — no intermediate processing layer — so performance equals hand-written SQL. A subtle but important misconception to kill: dbt does not extract or load anything, and ELT does not skip the transform. The T did not disappear; it moved — out of a proprietary engine, into the warehouse, into SQL an analyst can own.

So the slogan: Fivetran/Airbyte do the EL, dbt does the T, the warehouse runs it all.

Why ELT won analytics

Pull the threads together:

  • Elastic, cheap warehouse compute makes in-warehouse transforms practical — the original reason for a separate engine is gone.
  • Raw data is retained. When requirements change, you re-transform from the raw layer instead of re-extracting from source. The warehouse remembers everything; you can always re-derive a metric.
  • SQL transforms can be owned by analysts. Version-controlled, tested, documented (dbt) — software-engineering discipline applied to analytics, by the people closest to the business questions.
  • Simpler pipelines. EL is commoditized down to connectors; the T is decoupled and lives in one place.

ELT is the prevailing default for cloud analytics — and the stack is consolidating. Fivetran and dbt Labs completed an all-stock merger (agreement signed October 2025), folding EL ingestion and the Transform layer into a single vendor that the companies position as “the data infrastructure layer for trusted AI agents” (they report serving 100,000+ data teams — a company figure, not an audited one). On the same day, dbt Core v2.0 shipped as an early alpha under Apache 2.0, open-sourcing much of the Rust-based Fusion engine and unifying dbt Core and Fusion on one runtime. Separately, Fivetran acquired the reverse-ETL vendor Census around May 2025. The vendors are merging, but note the mechanism underneath is unchanged: load raw, transform in-warehouse with SQL.

When ETL is still the right answer

ELT is not a law of nature, and ETL is not obsolete. Reach for the old ordering — transform before load — in three cases:

  1. PII that must be masked or tokenized BEFORE it lands. Under GDPR, HIPAA, PCI-DSS, or CCPA you often cannot let raw sensitive data sit in the warehouse’s raw layer at all. Masking it during ETL keeps it out entirely, and static pre-masking is far cheaper at query time than masking dynamically on every read.
  2. Heavy non-SQL transformations that warehouse SQL is poorly suited to express.
  3. Low-latency / real-time streaming, where you transform in flight rather than waiting for data to batch-land. Apache Flink consuming Apache Kafka topics — filter, enrich, join, window with state, output in milliseconds-to-seconds — is the classic shape here.

And be honest about ELT’s one real downside: because raw loads before validation, data-quality problems surface downstream, in the marts or in a dashboard, rather than being filtered at ingestion. That is precisely why dbt embeds tests in the transform layer. In practice, many real teams run a hybrid — pre-mask PII on the way in, then do everything else in-warehouse.

One more distinct pattern worth a name: reverse ETL runs the other direction — it copies modeled data from the warehouse back into operational SaaS tools (Salesforce, HubSpot, Zendesk, ad platforms) so business teams can act on it, a move often called “data activation.” It is not ETL backwards with the same tools; it is its own category (pioneered by Census and Hightouch) and it requires a warehouse as the source.

Quick check

Quick check

0/3
Q1In ELT, where does the Transform step run, and what makes that practical?
Q2A teammate says 'Fivetran transforms our data and dbt loads it into Snowflake.' What's wrong?
Q3A healthcare startup must tokenize patient identifiers so raw PII NEVER sits in the warehouse, for HIPAA. Pure ELT or pre-load ETL — and why?

The one-line takeaway

ELT did not just reorder three letters. Moving Transform after Load — out of a separate engine and into cheap, elastic warehouse compute — kept the raw data around and handed the transform to analysts writing SQL. That reorder is the modern data stack.

Practice this in an interview

All questions
What is the difference between ETL and ELT, and when should you choose each?

ETL transforms data before loading it into the destination, which was necessary when warehouses were expensive and compute-constrained. ELT loads raw data first and transforms inside the warehouse, leveraging cheap cloud compute and making raw data available for reprocessing. ELT is the default in modern cloud stacks; ETL still makes sense when you must mask sensitive fields before they ever land in the warehouse.

What are the differences between a data warehouse, a data lake, and a data lakehouse?

A data warehouse stores structured, schema-on-write data optimized for SQL analytics but is expensive for raw or unstructured data. A data lake stores any format cheaply on object storage but lacks ACID transactions and query performance. A lakehouse layers open table formats (Delta Lake, Iceberg, Hudi) on object storage to deliver warehouse-grade performance and ACID semantics at data lake costs — it is the dominant architecture in 2026.

What is lazy evaluation in Spark, and how does it differ from transformations vs actions?

Spark does not execute any computation when you call a transformation — it builds a DAG of logical steps. Only when you call an action does Spark compile that DAG into physical tasks and execute them. This design lets Catalyst optimize the full query plan before touching any data.

What is the difference between OLTP and OLAP systems, and why can't you run analytics on your production database?

OLTP (Online Transaction Processing) systems handle high-throughput, low-latency reads and writes for individual records — think order placement, user authentication. OLAP (Online Analytical Processing) systems handle complex aggregations over millions of rows for business intelligence. Running heavy analytics directly on an OLTP database locks rows, competes for I/O, and slows application queries that customers feel.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Explore further

Related lessons

Skip to content