Data & model versioning
Version your datasets and model artifacts the way you version code, so any run is reproducible. Why a git commit alone isn't enough, and how DVC/lakeFS track data without bloating the repo.
What you'll learn
- Why a code commit alone can't reproduce an ML run
- How DVC/lakeFS version large data via pointers + object storage
- What 'reproducible' actually requires — code + data + params
Before you start
A teammate reports a model in prod is misbehaving. You check out the exact git commit that trained it, re-run, and get a different model. Why? Because the code was versioned but the data wasn’t — the training table has had three months of rows added since. A git SHA pins your code; it says nothing about the data that actually shaped the model.
Why git alone fails
Git is built for text and chokes on data: a 5 GB Parquet file or a 2 GB model checkpoint doesn’t belong in a repo (it bloats history, and diffs are meaningless). So teams just… don’t version data — and lose reproducibility. The fix is to version data the way git versions code, but store the bytes elsewhere:
DVC (Data Version Control) does exactly this: it replaces your big files with
tiny .dvc pointer files (containing a content hash) that are committed to git,
while the real bytes go to S3/GCS/Azure. Now git checkout <commit> plus dvc checkout restores the exact code and the exact data of that run.
# Version a dataset with DVC
dvc add data/train.parquet # creates data/train.parquet.dvc (a hash pointer)
git add data/train.parquet.dvc .gitignore
git commit -m "dataset v3: +Q1 2026 rows"
dvc push # uploads the bytes to remote object storage
# Months later, reproduce the exact run:
git checkout <commit> # restores code + the .dvc pointer
dvc checkout # pulls back the matching 5 GB dataset
Reproducible = code + data + params + environment
A run is only reproducible if you can pin all of its inputs. The checklist:
- Code — git commit (you have this).
- Data — a data version hash (DVC/lakeFS/Delta).
- Params — hyperparameters and config, logged with MLflow.
- Environment — pinned dependencies, ideally a Docker image digest.
Log all four together — that’s what makes the model registry’s lineage actually trustworthy.
Quick check
Quick check
Next
Versioned data feeds an honest model registry and makes ML tests reproducible. Next: proving a new model is actually better with A/B testing.
Practice this in an interview
All questionsA git commit captures code, but an ML run also depends on the exact training data, hyperparameters, environment, and randomness, none of which live in Git. Datasets are too large for Git and change independently of code, so you need a data-versioning tool like DVC or lakeFS to pin a content hash of the data to the commit. Full reproducibility means versioning code, data, config, environment, and seeds together and linking them.
DVC (and lakeFS) version raw datasets and model artifacts as immutable snapshots tied to Git commits, giving reproducibility and rollback. A feature store manages computed features for training and serving, its main job being to keep offline and online feature definitions in sync to prevent training-serving skew. They are complementary: DVC answers what data made this model, while a feature store answers how do I serve the same features consistently.
Full ML reproducibility requires locking three layers: the random seed across all frameworks, the software environment via pinned dependency manifests or container images, and the training data via content-addressed versioning. Missing any one layer means the same code can produce different models on different runs or machines.
A model registry is a centralised store that tracks every trained model artifact alongside its metadata — hyperparameters, training data version, evaluation metrics, and lineage. Versioning assigns unique identifiers to each artifact and manages lifecycle stages so teams can promote, roll back, and audit models without manual file management.