How does DVC differ from a feature store, and when would you reach for each?
DVC (and lakeFS) version raw datasets and model artifacts as immutable snapshots tied to Git commits, giving reproducibility and rollback. A feature store manages computed features for training and serving, its main job being to keep offline and online feature definitions in sync to prevent training-serving skew. They are complementary: DVC answers what data made this model, while a feature store answers how do I serve the same features consistently.
How to think about it
The short answer
They solve different problems. DVC/lakeFS version datasets and artifacts — immutable snapshots tied to a Git commit for reproducibility and rollback. A feature store manages computed features and its headline job is keeping training (offline) and serving (online) feature logic identical to prevent training-serving skew.
Why the distinction matters
Data versioning is about lineage and reproducibility: “which exact bytes produced this model, and can I roll back?” A feature store is about consistency and reuse at serving time: the same user_avg_purchase_30d must be computed the same way in your batch training job and your low-latency online endpoint. As a Databricks overview notes, feature stores also add discovery, documentation, and governance so features get reused instead of rebuilt.
Concrete example
- You retrain a churn model and accuracy drops. DVC lets you diff the dataset snapshot against last week’s to find a corrupted partition, and roll back.
- Your model scores 0.92 offline but 0.84 in production. That smells like feature store territory: the online service computed a rolling average over a different window than the training pipeline did.
When to reach for each
- Reach for DVC/lakeFS whenever you need reproducible experiments, dataset rollback, or audit trails — every serious project should have this.
- Reach for a feature store (Feast, Tecton, Databricks FS) when you have real-time inference and shared features across teams. For a pure batch project with one team, it’s often overkill.
Common follow-up / trap
The trap is claiming a feature store “fixes” reproducibility or replaces data versioning — it doesn’t. A feature store still needs point-in-time-correct snapshots to avoid label leakage, and you still need artifact versioning for the model itself. The mature answer is “both, layered”: version raw data with DVC, serve features through a store, and stamp the model with both versions.