How does DVC differ from a feature store, and when would you reach for each?

DVC (and lakeFS) version raw datasets and model artifacts as immutable snapshots tied to Git commits, giving reproducibility and rollback. A feature store manages computed features for training and serving, its main job being to keep offline and online feature definitions in sync to prevent training-serving skew. They are complementary: DVC answers what data made this model, while a feature store answers how do I serve the same features consistently.

Why isn't a git commit enough to reproduce an ML training run?

A git commit captures code, but an ML run also depends on the exact training data, hyperparameters, environment, and randomness, none of which live in Git. Datasets are too large for Git and change independently of code, so you need a data-versioning tool like DVC or lakeFS to pin a content hash of the data to the commit. Full reproducibility means versioning code, data, config, environment, and seeds together and linking them.

How do you achieve reproducibility in ML training pipelines — covering seeds, environment, and data versioning?

Full ML reproducibility requires locking three layers: the random seed across all frameworks, the software environment via pinned dependency manifests or container images, and the training data via content-addressed versioning. Missing any one layer means the same code can produce different models on different runs or machines.

What is a model registry, and how does model versioning work in production ML systems?

A model registry is a centralised store that tracks every trained model artifact alongside its metadata — hyperparameters, training data version, evaluation metrics, and lineage. Versioning assigns unique identifiers to each artifact and manages lifecycle stages so teams can promote, roll back, and audit models without manual file management.

Data & model versioning — MLOps

The last lesson left us holding a fingerprint without the thing it points to. MLflow dutifully recorded that a model trained on data with hash a3f9… — and when the model misbehaves months later, that string proves the data has changed while telling you nothing about how to get the original bytes back. A hash records; it does not store. This lesson closes that gap.

A teammate reports a model in prod is misbehaving. You check out the exact git commit that trained it, re-run, and get a different model. Why? Because the code was versioned but the data wasn’t — the training table has had three months of rows added since. A git SHA pins your code; it says nothing about the data that actually shaped the model.

Why git alone fails

Git is built for text and chokes on data: a 5 GB Parquet file or a 2 GB model checkpoint doesn’t belong in a repo (it bloats history, and diffs are meaningless). So teams just… don’t version data — and lose reproducibility. The fix is to version data the way git versions code, but store the bytes elsewhere:

Git tracks a small pointer file with the data’s hash; the actual bytes live in object storage. Checkout restores both.

DVC (Data Version Control) does exactly this: it replaces your big files with tiny .dvc pointer files (containing a content hash) that are committed to git, while the real bytes go to S3/GCS/Azure. Now git checkout <commit> plus dvc checkout restores the exact code and the exact data of that run.

# Version a dataset with DVC
dvc add data/train.parquet      # creates data/train.parquet.dvc (a hash pointer)
git add data/train.parquet.dvc .gitignore
git commit -m "dataset v3: +Q1 2026 rows"
dvc push                        # uploads the bytes to remote object storage

# Months later, reproduce the exact run:
git checkout <commit>           # restores code + the .dvc pointer
dvc checkout                    # pulls back the matching 5 GB dataset

Reproducible = code + data + params + environment

A run is only reproducible if you can pin all of its inputs. The checklist:

Code — git commit (you have this).
Data — a data version hash (DVC/lakeFS/Delta).
Params — hyperparameters and config, logged with MLflow.
Environment — pinned dependencies, ideally a Docker image digest.

Log all four together — that’s what makes the model registry’s lineage actually trustworthy.

In one breath

A git commit pins your code but says nothing about the data, so re-running an old commit on a since-changed table yields a different model — and you can’t just commit a 5 GB file to git; tools like DVC and lakeFS solve this by committing a tiny content-hash pointer to git while the real bytes live in object storage, so git checkout + dvc checkout restores the exact code and data together — and a run is only truly reproducible when all four inputs are pinned: code, data, params, and environment.

Practice

Before the quiz, reason about the pointer trick. DVC commits a tiny .dvc file holding a hash like a3f9… to git, and pushes the 5 GB of bytes to S3. Why does this give you git’s full time-travel over data without bloating the repo — and what single command pair restores a months-old run’s exact dataset? Then connect back to the last lesson: MLflow logged that hash as a param, but couldn’t store the bytes. Who stores them now, and how do the two tools divide the labor?

Quick check

0/3

Q1You check out the exact git commit that trained a production model and re-run, but get a different model. Most likely cause?

Q2How does DVC version a 5 GB dataset without bloating the git repo?

Q3What does a fully reproducible ML run require pinning?

A question to carry forward

Take stock of what the chapter has assembled so far. MLflow records every run; DVC versions the data behind it. So for any model you have ever trained, you can now pin all four inputs — code, data, params, environment — and prove exactly how it came to be. The provenance is airtight.

But provenance is not the same as control. You have forty perfectly-tracked runs; one of them belongs in production, the next should be waiting as a challenger, the previous champion should sit on the bench ready to roll back. Knowing each run’s lineage doesn’t tell anyone which run is live, which is being tested, or how a new version safely takes the throne from the old one. So the question to carry forward is the governance question: once every model is reproducible, what system decides which version is champion, promotes a challenger, and keeps the old one one command away? That is the model registry, and it is the next lesson.

Data & model versioning

What you'll learn

Before you start

Why git alone fails

Reproducible = code + data + params + environment

In one breath

Practice

Quick check

A question to carry forward

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further