datarekha

Data & model versioning

Version your datasets and model artifacts the way you version code, so any run is reproducible. Why a git commit alone isn't enough, and how DVC/lakeFS track data without bloating the repo.

6 min read Beginner MLOps Lesson 6 of 28

What you'll learn

  • Why a code commit alone can't reproduce an ML run
  • How DVC/lakeFS version large data via pointers + object storage
  • What 'reproducible' actually requires — code + data + params

Before you start

A teammate reports a model in prod is misbehaving. You check out the exact git commit that trained it, re-run, and get a different model. Why? Because the code was versioned but the data wasn’t — the training table has had three months of rows added since. A git SHA pins your code; it says nothing about the data that actually shaped the model.

Why git alone fails

Git is built for text and chokes on data: a 5 GB Parquet file or a 2 GB model checkpoint doesn’t belong in a repo (it bloats history, and diffs are meaningless). So teams just… don’t version data — and lose reproducibility. The fix is to version data the way git versions code, but store the bytes elsewhere:

Git repotrain.py (code)data.csv.dvcmd5: a3f9… (pointer)Object storage (S3)a3f9… → 5 GB datab1c2… → model.pthash points to the bytesgit checkout <commit> → restores matching code AND data version
Git tracks a small pointer file with the data’s hash; the actual bytes live in object storage. Checkout restores both.

DVC (Data Version Control) does exactly this: it replaces your big files with tiny .dvc pointer files (containing a content hash) that are committed to git, while the real bytes go to S3/GCS/Azure. Now git checkout <commit> plus dvc checkout restores the exact code and the exact data of that run.

# Version a dataset with DVC
dvc add data/train.parquet      # creates data/train.parquet.dvc (a hash pointer)
git add data/train.parquet.dvc .gitignore
git commit -m "dataset v3: +Q1 2026 rows"
dvc push                        # uploads the bytes to remote object storage

# Months later, reproduce the exact run:
git checkout <commit>           # restores code + the .dvc pointer
dvc checkout                    # pulls back the matching 5 GB dataset

Reproducible = code + data + params + environment

A run is only reproducible if you can pin all of its inputs. The checklist:

  • Code — git commit (you have this).
  • Data — a data version hash (DVC/lakeFS/Delta).
  • Params — hyperparameters and config, logged with MLflow.
  • Environment — pinned dependencies, ideally a Docker image digest.

Log all four together — that’s what makes the model registry’s lineage actually trustworthy.

Quick check

Quick check

0/3
Q1You check out the exact git commit that trained a production model and re-run, but get a different model. Most likely cause?
Q2How does DVC version a 5 GB dataset without bloating the git repo?
Q3What does a fully reproducible ML run require pinning?

Next

Versioned data feeds an honest model registry and makes ML tests reproducible. Next: proving a new model is actually better with A/B testing.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Practice this in an interview

All questions
Why isn't a git commit enough to reproduce an ML training run?

A git commit captures code, but an ML run also depends on the exact training data, hyperparameters, environment, and randomness, none of which live in Git. Datasets are too large for Git and change independently of code, so you need a data-versioning tool like DVC or lakeFS to pin a content hash of the data to the commit. Full reproducibility means versioning code, data, config, environment, and seeds together and linking them.

How does DVC differ from a feature store, and when would you reach for each?

DVC (and lakeFS) version raw datasets and model artifacts as immutable snapshots tied to Git commits, giving reproducibility and rollback. A feature store manages computed features for training and serving, its main job being to keep offline and online feature definitions in sync to prevent training-serving skew. They are complementary: DVC answers what data made this model, while a feature store answers how do I serve the same features consistently.

How do you achieve reproducibility in ML training pipelines — covering seeds, environment, and data versioning?

Full ML reproducibility requires locking three layers: the random seed across all frameworks, the software environment via pinned dependency manifests or container images, and the training data via content-addressed versioning. Missing any one layer means the same code can produce different models on different runs or machines.

What is a model registry, and how does model versioning work in production ML systems?

A model registry is a centralised store that tracks every trained model artifact alongside its metadata — hyperparameters, training data version, evaluation metrics, and lineage. Versioning assigns unique identifiers to each artifact and manages lifecycle stages so teams can promote, roll back, and audit models without manual file management.

Related lessons

Explore further

Skip to content