Why isn't a git commit enough to reproduce an ML training run?

For MLOps Engineer ML Engineer Data Scientist

The short answer

A git commit captures code, but an ML run also depends on the exact training data, hyperparameters, environment, and randomness, none of which live in Git. Datasets are too large for Git and change independently of code, so you need a data-versioning tool like DVC or lakeFS to pin a content hash of the data to the commit. Full reproducibility means versioning code, data, config, environment, and seeds together and linking them.

How to think about it

The short answer

A git commit reproduces your code, but a training run is a function of code plus data, hyperparameters, environment, and randomness. None of the last four are fully captured by a commit, so checking out the same SHA can still give you a different model.

Why

The biggest gap is data. Datasets are too large to live in Git, and they change on their own schedule (new labels, backfills, upstream fixes) independently of code. If you only have the commit, you have no record of which snapshot of the data produced the model. Tools like DVC and lakeFS solve this by storing a content hash (a pointer) in Git while the actual bytes live in object storage. Checking out a commit then restores the exact data version that was used.

The other gaps:

Config / hyperparameters must be tracked as code or in an experiment tracker (MLflow), not passed ad-hoc on the CLI.
Environment (library versions, CUDA, OS) needs pinning via a lockfile or container image digest.
Randomness needs fixed seeds, and even then GPU non-determinism can vary results.

Concrete example

You train a fraud model on Monday and get 0.91 AUC. On Friday you git checkout the same commit and retrain, but the labels table was backfilled mid-week. You now get 0.88 and waste a day chasing a “regression” that is actually a silent data change. With DVC, the data hash pinned to that commit would have flagged the mismatch immediately.

Common follow-up / trap

Interviewers often ask: “How do you link a deployed model back to everything that made it?” The strong answer is to stamp the model’s metadata with the git SHA and the data version hash and the run ID. The trap is saying “I’d just commit the data to Git” — that bloats the repo, breaks on large files, and still misses environment and seed reproducibility.

Learn it properly Data & model versioning

Why isn't a git commit enough to reproduce an ML training run?

The short answer

Why

Concrete example

Common follow-up / trap

Keep practising

Explore further