How do you achieve reproducibility in ML training pipelines — covering seeds, environment, and data versioning?
Full ML reproducibility requires locking three layers: the random seed across all frameworks, the software environment via pinned dependency manifests or container images, and the training data via content-addressed versioning. Missing any one layer means the same code can produce different models on different runs or machines.
How to think about it
Reproducibility is non-negotiable for debugging regressions, satisfying model audits, and retraining from a known checkpoint. It requires discipline across three independent layers.
Layer 1 — Random seeds
Set seeds in every framework that uses stochastic operations. Missing one source of randomness (e.g. setting torch.manual_seed but forgetting torch.cuda.manual_seed_all or numpy.random.seed) breaks reproducibility silently.
import os, random, numpy as np, torch
def set_seeds(seed: int = 42) -> None:
os.environ["PYTHONHASHSEED"] = str(seed)
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False # disables non-deterministic algos
Note: cudnn.deterministic = True can reduce GPU throughput by 5–20 % — acceptable for audit runs, not for every training job.
Layer 2 — Environment
Pin every dependency at the patch level in requirements.txt or pyproject.toml. For full bit-level reproducibility, build and push a Docker image with the exact CUDA driver, cuDNN, and Python runtime — never rely on latest or >= constraints. Store the image digest (SHA256) alongside every training run’s metadata.
# Capture exact environment after training
pip freeze > requirements-lock.txt
docker inspect myimage:v3 --format '{{.RepoDigests}}'
# sha256:abc123... <- store this in run metadata
Layer 3 — Data versioning
Data changes silently — rows are added, deleted, or corrected. Use a content-addressed versioning tool to snapshot the dataset:
# DVC: version datasets in Git without storing blobs in Git
dvc add data/training_2026-06-06.parquet
git add data/training_2026-06-06.parquet.dvc .gitignore
git commit -m "Add training snapshot 2026-06-06"
# .dvc file stores the MD5 hash — reproducible retrieval guaranteed
dvc push # push bytes to S3/GCS remote
Log the DVC commit hash or dataset URI in your MLflow / W&B run so any future run can reconstruct the exact dataset.