Feature stores — when you need one, when you don't
The online/offline skew problem in one sentence, then a runnable Feast-shaped feature pipeline. Plus the honest answer to 'do we need a feature store?' (usually: no — until you do).
What you'll learn
- The training/serving skew problem — why a SQL query that worked offline silently breaks online
- Offline store vs online store — the two halves and why both exist
- Defining a feature view with a Feast-shaped API
- `get_historical_features` (training) vs `get_online_features` (serving) — same definition, two reads
- When a feature store is the right answer, and when it's premature infrastructure
Before you start
A senior ML engineer once described the problem like this. The data scientist wrote a SQL query that joined three tables, computed a 30-day rolling average of user activity, and trained a churn model that hit 0.87 AUC. Six months later, the same model in production is getting 0.71 AUC, and nobody can figure out why.
It turns out the production feature service computes the same “30-day
rolling average” using a slightly different time window
(now() - INTERVAL 30 DAY instead of the training query’s
event_date BETWEEN snapshot_date - 30 AND snapshot_date). The
features look the same. They aren’t.
That’s training/serving skew. It’s the single most common failure mode in production ML, and it’s what feature stores were invented to fix.
The mental model
A feature store has two halves, and both halves answer the same question — “what was feature X for entity Y at time T?” — using very different infrastructure.
| Half | What it stores | Read latency | Used for | Typical backing |
|---|---|---|---|---|
| Offline store | Historical feature values, point-in-time | Minutes | Training, backfills | Parquet on S3, BigQuery, Snowflake, Delta |
| Online store | Latest feature values per entity | < 10 ms | Real-time inference | Redis, DynamoDB, Bigtable, RocksDB |
The whole point: one feature definition, two reads. You define
user_thirty_day_activity once. Training reads it from the offline
store with full history. Serving reads it from the online store at
inference time. The store guarantees they’re computed the same way.
Skew goes away because it can’t appear — by construction, both reads use the same upstream computation.
A feature view, Feast-shaped
Feast is the open-source default. The shape of its API has become the industry’s lingua franca even when teams use other stores.
# feature_repo/features.py — what your feature definitions actually look like
from datetime import timedelta
from feast import (
Entity, FeatureView, Field, FileSource, ValueType, FeatureService,
)
from feast.types import Float32, Int64, String
# 1) Entities — the "thing" features are attached to.
user = Entity(name="user_id", value_type=ValueType.INT64, description="User primary key")
# 2) Source — where the offline data lives (here a Parquet file; in prod, BQ/Snowflake/S3).
user_activity_source = FileSource(
path="s3://yourorg-features/user_activity.parquet",
timestamp_field="event_timestamp",
created_timestamp_column="created_ts",
)
# 3) Feature view — the unit of definition. Names, types, freshness, TTL.
user_activity_fv = FeatureView(
name="user_activity_features",
entities=[user],
ttl=timedelta(days=2), # online value is stale after 2 days of no update
schema=[
Field(name="thirty_day_session_count", dtype=Int64),
Field(name="seven_day_avg_session_minutes", dtype=Float32),
Field(name="days_since_last_purchase", dtype=Int64),
Field(name="lifetime_purchases", dtype=Int64),
],
source=user_activity_source,
online=True,
)
# 4) Feature service — what your model actually consumes (a "view of views")
churn_risk_v1 = FeatureService(
name="churn_risk_v1",
features=[user_activity_fv],
)
That’s the entire contract. The same user_activity_fv definition
backs both training reads and online lookups.
Training — get_historical_features
For training, you have a list of (entity, label, timestamp) rows. You ask the store: “for each of these rows, give me the feature values as they were at that exact timestamp.”
This is point-in-time correctness — the killer feature. Without it, you accidentally use future information to predict the past, and your offline AUC ends up wildly higher than production performance.
# Training-time read — features as of each row's event_timestamp
from feast import FeatureStore
import pandas as pd
fs = FeatureStore(repo_path="feature_repo/")
# Your training spine: (user_id, label, when-the-prediction-was-made)
entity_df = pd.read_parquet("training_labels.parquet")
# columns: user_id, churned_within_30_days, event_timestamp
training_df = fs.get_historical_features(
entity_df=entity_df,
features=fs.get_feature_service("churn_risk_v1"),
).to_df()
# training_df now has the original columns + the feature columns,
# each feature value taken at-or-before its row's event_timestamp.
Two non-obvious wins:
- You can’t accidentally leak future state, because Feast enforces “no
feature value with
event_timestamp> the row’s timestamp.” - You can re-train on any historical date by changing
entity_df’s timestamps. Time-travel is built in.
Serving — get_online_features
For real-time inference, you have a single entity (a user_id) and
need the latest values, fast.
# Serving-time read — latest values, low latency
features = fs.get_online_features(
features=fs.get_feature_service("churn_risk_v1"),
entity_rows=[{"user_id": 4242}],
).to_dict()
# features = {
# "user_id": [4242],
# "thirty_day_session_count": [27],
# "seven_day_avg_session_minutes": [13.4],
# "days_since_last_purchase": [9],
# "lifetime_purchases": [4],
# }
model_input = [
features["thirty_day_session_count"][0],
features["seven_day_avg_session_minutes"][0],
features["days_since_last_purchase"][0],
features["lifetime_purchases"][0],
]
score = model.predict_proba([model_input])[0][1]
A separate process — feast materialize — periodically computes and
pushes the latest feature values from the offline store into the online
store (hence “materialize”: turning a logical definition into stored,
ready-to-read values). That’s the only piece of infrastructure the
feature store is asking you to operate.
A runnable worked example
You can’t run the full Feast stack in the browser (it needs an online store and a materialize job), but you can run a faithful shape-of-it that demonstrates the duality. Same feature definition, two reads, both giving you the right answer.
The thing to internalise from that simulation: the same
materialize() function powers both reads. Real Feast doesn’t compute
on the fly; it materializes into the online store on a schedule. But
the contract is identical.
When a feature store is the right answer
A feature store earns its weight when several of these are true:
- You have many models reusing the same features. The “definition once, reads everywhere” win compounds with reuse. With one model and four features, the wins are theoretical.
- You serve in real time (single-digit ms latency budgets). You need an online store.
- You’ve already shipped a model that drifted because training features and serving features diverged. You know the pain.
- Multiple teams contribute features and you need governance: ownership, lineage, freshness SLAs.
- You have point-in-time correctness needs — temporal features where using future data leaks.
When it’s premature infrastructure
Equally true:
- You have under five production models. The shared-definition payoff is small.
- You don’t serve in real time. If your inference is batch (nightly scoring), the offline store is your feature store — Parquet on S3 with snapshot dates does the job, and adding Feast on top is ceremony.
- Your features are simple aggregates from a single source. A view in Snowflake or BigQuery, materialized nightly, is cheaper to operate than a feature store.
- Nobody on the team will own the feature store. Like Kubeflow, Feast/Tecton needs an owner. An unmaintained feature store decays fast.
The managed vs. open-source landscape
| Option | What it is | Sweet spot |
|---|---|---|
| Feast | Open-source, BYO infra (Redis + S3 + your own materialize) | Teams with existing cloud infra; flexibility over hand-holding |
| Tecton | Managed feature platform with streaming, transformations, SLA monitoring | Real-time, multi-team, complex transformations |
| Databricks Feature Engineering | Feature store inside Databricks workspace | Already on Databricks; want one less integration |
| SageMaker Feature Store | AWS-native feature store with offline + online stores | Heavy SageMaker shops |
| Vertex AI Feature Store | GCP-native equivalent | Heavy Vertex shops |
| Just a warehouse view | Materialized SQL, no special framework | Batch scoring, fewer than 5 models, no real-time needs |
The pattern: the more models, teams, and real-time needs you have, the more justification for a managed (Tecton, Databricks, SageMaker) option. The more you’re a one-team shop with batch needs, the more “a table in Snowflake” is the right answer.
What never changes — the feature contract
Whatever you choose, the idea is the part you internalise:
- A feature is a named, typed, owned quantity attached to an entity at a timestamp.
- The definition of that feature is code — versioned, reviewed, tested.
- Training and serving read the same definition, never reimplement it.
You can do all three without Feast. You can fail at all three with Feast. The infrastructure is downstream of the discipline.
Quick check
Quick check
Next
You’ve now got the data side (features) and the serving side (FastAPI, K8s) wired up. The next thing that bites a real production system is drift — when the world quietly changes underneath your trained model.
Practice this in an interview
All questionsA feature store is a shared data platform that computes, stores, and serves ML features consistently for both training and serving. It eliminates training-serving skew by ensuring the same transformation code runs in both contexts, and it reduces duplicated work by letting teams share and discover features across models.
The most common cause is training-serving skew: the distribution of features at serving time differs from the training data. The fix requires instrumenting the pipeline to log serving inputs, compare their distribution to training data, and identify whether the gap is due to data drift, feature engineering bugs, label leakage, or infrastructure inconsistencies.
Train/serve skew occurs when the feature values a model sees at training time differ from those it sees at inference time, even for the same raw input — caused by divergent preprocessing code paths, different data sources, or temporal leakage. It silently degrades performance without raising obvious errors.
Feature leakage occurs when information from the test set or from the future leaks into training features, making a model appear more accurate than it will be in production. It arises from fitting preprocessing steps on the full dataset, using post-event information as a predictor, or computing aggregates across train-test boundaries. Prevention requires strict pipeline discipline: all stateful transformations must be fit only on training data.