datarekha
MLOps Medium Asked at DeepMindAsked at OpenAIAsked at MetaAsked at GoogleAsked at Weights and Biases

How do you achieve reproducibility in ML training pipelines — covering seeds, environment, and data versioning?

The short answer

Full ML reproducibility requires locking three layers: the random seed across all frameworks, the software environment via pinned dependency manifests or container images, and the training data via content-addressed versioning. Missing any one layer means the same code can produce different models on different runs or machines.

How to think about it

Reproducibility is non-negotiable for debugging regressions, satisfying model audits, and retraining from a known checkpoint. It requires discipline across three independent layers.

Layer 1 — Random seeds

Set seeds in every framework that uses stochastic operations. Missing one source of randomness (e.g. setting torch.manual_seed but forgetting torch.cuda.manual_seed_all or numpy.random.seed) breaks reproducibility silently.

import os, random, numpy as np, torch

def set_seeds(seed: int = 42) -> None:
    os.environ["PYTHONHASHSEED"] = str(seed)
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False   # disables non-deterministic algos

Note: cudnn.deterministic = True can reduce GPU throughput by 5–20 % — acceptable for audit runs, not for every training job.

Layer 2 — Environment

Pin every dependency at the patch level in requirements.txt or pyproject.toml. For full bit-level reproducibility, build and push a Docker image with the exact CUDA driver, cuDNN, and Python runtime — never rely on latest or >= constraints. Store the image digest (SHA256) alongside every training run’s metadata.

# Capture exact environment after training
pip freeze > requirements-lock.txt
docker inspect myimage:v3 --format '{{.RepoDigests}}'
# sha256:abc123...  <- store this in run metadata

Layer 3 — Data versioning

Data changes silently — rows are added, deleted, or corrected. Use a content-addressed versioning tool to snapshot the dataset:

# DVC: version datasets in Git without storing blobs in Git
dvc add data/training_2026-06-06.parquet
git add data/training_2026-06-06.parquet.dvc .gitignore
git commit -m "Add training snapshot 2026-06-06"
# .dvc file stores the MD5 hash — reproducible retrieval guaranteed
dvc push   # push bytes to S3/GCS remote

Log the DVC commit hash or dataset URI in your MLflow / W&B run so any future run can reconstruct the exact dataset.

Keep practising

All MLOps questions

Explore further

Skip to content