datarekha

MLflow on Databricks — tracking to serving

Databricks invented MLflow, then built the rest of an ML platform around it. Tracking, Unity Catalog model registry, and serving endpoints — what's actually production-ready.

9 min read Advanced PySpark Lesson 22 of 22

What you'll learn

  • How managed MLflow auto-logs from any Databricks notebook or job
  • Registering models to Unity Catalog (the new path) vs the deprecated Workspace registry
  • Serving endpoints — real-time inference with auto-scaling and A/B
  • Feature Engineering in Unity Catalog — Databricks' native feature store

Before you start

Databricks built MLflow. They also built the rest of an ML platform around it — and that integration is the strongest argument for doing ML on Databricks rather than wiring up the OSS pieces yourself. Every notebook has a tracking server attached. Models register to Unity Catalog with lineage. Serving endpoints are one CLI command. Feature tables live alongside your data tables.

This lesson is what changes when you take a model from “works in a notebook” to “serves traffic in production” on Databricks.

What you get for free

The instant you import mlflow in a Databricks notebook, several things are already wired up:

  • Tracking server — calling mlflow.<flavor>.autolog() (or enabling workspace-level autologging in settings) captures metrics, parameters, and the trained model artifact for sklearn, XGBoost, LightGBM, PyTorch, TensorFlow, and Spark MLlib with no extra code.
  • Experiment — each notebook has an auto-created experiment at the notebook’s path. Runs from that notebook appear there.
  • Artifact store — backed by your workspace’s DBFS / Unity Catalog volume. No S3 setup, no auth config.
  • UI — the “Experiments” tab in the workspace, filterable, diff-able, with per-run plots.

A complete training script that takes advantage of all of this:

import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# 1. Read features from a UC table — full lineage tracking
df = spark.read.table("main.ml.churn_features").toPandas()

X = df.drop(columns=["churned"])
y = df["churned"]
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42)

# 2. Autolog: every metric, param, and the model itself, captured automatically
mlflow.sklearn.autolog()

with mlflow.start_run(run_name="rf-baseline"):
    model = RandomForestClassifier(n_estimators=200, max_depth=8)
    model.fit(X_tr, y_tr)

    # Manual logging on top of autolog still works
    test_acc = model.score(X_te, y_te)
    mlflow.log_metric("test_accuracy", test_acc)

No tracking URI to configure. No S3 bucket to create. The run appears in the experiment UI before the cell finishes.

The shift to Unity Catalog model registry

Up to 2023, Databricks shipped a Workspace Model Registry — a flat namespace of registered models, one per workspace. It worked but didn’t fit the Unity Catalog world. Since 2024, the recommended path is to register models in Unity Catalog, using the three-level namespace.

mlflow.set_registry_uri("databricks-uc")

# Register from a finished run
result = mlflow.register_model(
    model_uri = f"runs:/{run_id}/model",
    name = "main.ml.churn_model",
)
print(result.version)   # "3"

A UC-registered model gets you:

  • Permissions — same GRANT/REVOKE as tables. An analyst can SELECT features but not EXECUTE the model unless granted.
  • Lineage — UC tracks which feature tables fed the model, which notebook trained it, which endpoints serve it.
  • Cross-workspace access — a model registered in prod can be read by a staging workspace pointed at the same UC.
  • Aliases@champion, @challenger, @production are movable pointers to versions:
from mlflow.tracking import MlflowClient
client = MlflowClient(registry_uri="databricks-uc")

# Promote version 3 to "champion"
client.set_registered_model_alias(
    "main.ml.churn_model", "champion", 3,
)

# Load the current champion — your serving code doesn't care about version
model = mlflow.pyfunc.load_model("models:/main.ml.churn_model@champion")

Aliases are the replacement for the old Staging / Production stages. They’re movable pointers; the underlying versions stay immutable, which is what you want for audit.

Model serving — real-time inference

Loading a model into a notebook is fine for batch scoring (you write predictions back to a Delta table). For real-time use cases — fraud checks, recommendations, anything that needs sub-second response — you want a serving endpoint.

A Databricks serving endpoint is an auto-scaling HTTPS service backed by a UC-registered model version. Create one from the CLI:

databricks serving-endpoints create \
  --name churn-endpoint \
  --json '{
    "config": {
      "served_entities": [{
        "entity_name": "main.ml.churn_model",
        "entity_version": "3",
        "workload_size": "Small",
        "scale_to_zero_enabled": true
      }]
    }
  }'

scale_to_zero_enabled: true is the cost-saver — the endpoint spins down idle replicas. First request after idle takes a few seconds (cold start); steady traffic stays warm.

For A/B testing, an endpoint can serve multiple model versions behind a traffic split:

{
  "config": {
    "served_entities": [
      {"entity_name": "main.ml.churn_model", "entity_version": "3", "name": "v3"},
      {"entity_name": "main.ml.churn_model", "entity_version": "4", "name": "v4"}
    ],
    "traffic_config": {
      "routes": [
        {"served_model_name": "v3", "traffic_percentage": 90},
        {"served_model_name": "v4", "traffic_percentage": 10}
      ]
    }
  }
}

10% of requests hit v4, 90% hit v3. You compare metrics in the endpoint’s monitoring tab; if v4 wins, you promote it to 100% by updating the config.

Querying the endpoint is a plain HTTPS POST:

curl -X POST https://<workspace>.cloud.databricks.com/serving-endpoints/churn-endpoint/invocations \
  -H "Authorization: Bearer $DATABRICKS_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"dataframe_records": [{"tenure_months": 12, "monthly_charges": 70.5}]}'

The endpoint handles request batching, auto-scaling, and metrics collection for free. You don’t run a Flask server.

Feature Engineering in Unity Catalog

Training and serving need the same features computed the same way. That’s the “training-serving skew” problem. The OSS solution is a feature store like Feast. Databricks ships Feature Engineering in Unity Catalog (formerly Databricks Feature Store) — a native feature store backed by UC tables.

A feature table is just a Delta table with a primary key, registered to UC. You compute it once (a scheduled job), then both training code and serving code read from the same table:

from databricks.feature_engineering import FeatureEngineeringClient, FeatureLookup

fe = FeatureEngineeringClient()

# Define a feature table
fe.create_table(
    name = "main.ml.customer_features",
    primary_keys = ["customer_id"],
    schema = customer_features_df.schema,
    description = "Per-customer features for churn model",
)
fe.write_table(name="main.ml.customer_features", df=customer_features_df, mode="merge")

# At training time, join features to a labels dataframe
training_set = fe.create_training_set(
    df = labels_df,
    feature_lookups = [
        FeatureLookup(
            table_name = "main.ml.customer_features",
            feature_names = ["tenure_months", "monthly_charges", "avg_session_minutes"],
            lookup_key = "customer_id",
        ),
    ],
    label = "churned",
)

training_df = training_set.load_df().toPandas()
# ... train your model on training_df ...

# Log model WITH feature spec — serving endpoint will auto-fetch features
fe.log_model(
    model = trained_model,
    artifact_path = "model",
    flavor = mlflow.sklearn,
    training_set = training_set,
    registered_model_name = "main.ml.churn_model",
)

The killer detail is in the last call: fe.log_model records which features were used. At inference time, the serving endpoint can look up the features by primary key automatically — you POST {"customer_id": 42} and the endpoint fetches the latest features from the UC table before scoring.

That eliminates the “the feature value at train time differs from serve time” class of bug, and it means your serving client doesn’t need to know what features the model uses.

Lineage in one query

Once you’ve done a UC-registered training run, the lineage is queryable. Databricks surfaces model-to-table lineage in system.access.table_lineage and exposes endpoint config in system.serving.served_entities (enable system tables in the account console first):

-- What tables did the churn model training run read?
SELECT DISTINCT source_table_full_name
FROM system.access.table_lineage
WHERE target_table_full_name = 'main.ml.churn_model';

-- What endpoints are serving entities from the main catalog?
SELECT name, entity_name, entity_version, state
FROM system.serving.served_entities
WHERE entity_name LIKE 'main.ml.%';

The exact system table schema evolves as Databricks adds capabilities — check the system tables docs for the current column names before querying. The principle is the same: this is the kind of audit trail that’s tedious in a hand-rolled MLOps stack and free in Databricks because everything goes through UC.

A pyfunc model in pure Python

The MLflow pyfunc abstraction — “any model is a callable that takes a dataframe and returns predictions” — is small enough to demo:

Aliases as movable pointers to immutable versions is the whole trick. Production deployments load @champion; promoting a new version is a single alias-swap that takes effect on the next load.

Honest take

The Databricks ML stack is excellent when you’re already on Databricks. Tracking, registry, serving, feature store — all integrated, all governed by UC, all queryable.

The catch: it’s a Databricks ecosystem. Endpoints don’t serve outside the platform. The feature store doesn’t work without UC. If your training is on Databricks but serving is on a Kubernetes cluster elsewhere, you’re either lifting models to a generic registry (OSS MLflow + S3 + Ray Serve or BentoML), or doubling your feature pipeline. Pick the stack based on where all of training, batch scoring, and online serving will live.

If that’s all Databricks: use the native pieces. If not: OSS MLflow plus your own serving is still a perfectly good answer.

Quick check

Quick check

0/3
Q1Why does Databricks recommend Unity Catalog model registry over the legacy Workspace registry?
Q2What's the practical advantage of `fe.log_model(...)` with a `training_set` over plain `mlflow.sklearn.log_model`?
Q3You want to do A/B testing between model v3 and v4 with 90/10 traffic split. How does a Databricks serving endpoint handle this?

Next

You now have the full Databricks loop: store data in Delta, transform with PySpark, schedule with Workflows, train and serve with MLflow. The next layer up is the production discipline — testing, CI/CD, observability — that makes the loop reliable enough to ship without losing sleep.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Practice this in an interview

All questions
What does experiment tracking solve, and how do MLflow and Weights and Biases differ in practice?

Experiment tracking captures the full reproducibility context of a training run — code version, hyperparameters, dataset hash, environment, and metrics — so any result can be reproduced and compared. MLflow is an open-source, self-hosted lifecycle platform; Weights and Biases is a hosted, collaboration-first product with richer real-time visualisation.

Why use a pipeline orchestrator like Airflow or Kubeflow instead of cron scripts for ML workflows?

ML workflows are multi-step DAGs with dependencies, and an orchestrator gives you dependency management, retries, backfills, caching, observability, and lineage that chained cron jobs cannot. Airflow is a general-purpose task orchestrator defining DAGs in Python, while Kubeflow Pipelines is ML-native, passing typed artifacts between containerized steps on Kubernetes with conditional logic like deploy only if accuracy exceeds a threshold. Choosing depends on whether you need generic scheduling or ML-specific, container-based pipelines.

How do you attribute and control ML spend across teams and models (FinOps for ML)?

Apply FinOps to ML by tagging every workload (training jobs, endpoints, GPU pools) by team, model, and environment so cost is attributable, then track unit-economics metrics like cost per prediction or per training run rather than just total spend. Set budgets and alerts, identify idle GPUs and overprovisioned endpoints, and enforce guardrails like autoscaling and instance-type policies. The goal is continuous visibility and accountability so teams optimize cost without killing experimentation.

What is a feature store and why is it critical for production ML systems?

A feature store is a shared data platform that computes, stores, and serves ML features consistently for both training and serving. It eliminates training-serving skew by ensuring the same transformation code runs in both contexts, and it reduces duplicated work by letting teams share and discover features across models.

Related lessons

Explore further

Skip to content