Why XGBoost is still winning in 2026

In May 2026, somewhere at a Fortune 500 in the middle of the United States, a data scientist with seven years of experience is closing the third browser tab of a TabPFN benchmark and writing import xgboost as xgb in a new Jupyter cell. Their reasons are not theoretical. The pipeline they’re building has to fit in a four-hour CI run, train on a dataset their cloud budget cares about, hand off feature importances that a compliance officer can read, and be debuggable by the colleague who will inherit it next quarter. XGBoost does all of that. The transformer-based contender they just benchmarked is 0.4 AUC points better and 8x slower.

The contender loses. Again.

This story has played out at roughly every enterprise data team I’ve talked to in the last eighteen months. The narrative arc of “transformers will eat tabular ML the way they ate vision and language” has been promised yearly since 2020. The receipts, in mid-2026, say something subtler: the transformer-tabular models are good, in some niches they are better, and they have not displaced gradient boosting because the cost of switching exceeds the marginal gain for the median project. This post is about why.

What the Kaggle and benchmark data actually say

The cleanest external benchmark for “what wins on tabular data” is still Kaggle. The blunt summary in 2025: gradient boosting algorithms — XGBoost, LightGBM, CatBoost — appear in over 80% of winning solutions for structured-data competitions. The April 2025 Playground winner stacked three levels of models, and the final layer was still a gradient-boosted tree.

Two more recent benchmarks add nuance.

TabArena, which controls hardware and hyperparameter budgets, found CatBoost was the top GBDT under conventional tuning, particularly on high-cardinality categoricals and mixed-type features. Across 176 datasets in the underlying NeurIPS 2023 study, the mean ranks landed at CatBoost 5.06, XGBoost 6.38, LightGBM 7.80 — close enough that the choice usually comes down to dataset shape and team familiarity, not raw accuracy.

TabPFN-2.5, released November 2025, made bigger claims. The paper reports a 100% win rate against default XGBoost on classification datasets up to 10,000 rows and 500 features, and an 87% win rate up to roughly 100K rows and 2K features — with no training or hyperparameter tuning. That’s a remarkable result. It is also why TabPFN deserves its own section later. But the right reading of the result is narrow: a transformer foundation model can beat default XGBoost on small- to-medium datasets when XGBoost is left at defaults. The moment you tune XGBoost — and any production team does — the margin compresses, and for datasets above the size threshold it inverts.

Why XGBoost won in the first place, and still does

XGBoost was released in 2014. The original Chen and Guestrin paper contributed three ideas that the field hadn’t combined before: weighted quantile sketch for fast split-finding, sparsity-aware learning to skip missing values during training, and an out-of-core block structure that made it trainable on data larger than RAM. The combination was a 10x speedup over the gradient-boosting libraries that existed at the time, with state-of-the-art accuracy as a bonus.

Twelve years later, those engineering choices have aged better than the academic gains. The reasons a production team picks XGBoost in 2026 are mostly the same reasons they picked it in 2016:

Six structural advantages, each individually mild, that compose into “the default wins again.” Any challenger needs to win all six to displace XGBoost — and none has.

Training speed. XGBoost trains a competitive tabular model on a few hundred thousand rows in minutes on a single CPU. Transformer-based contenders need GPU time. For a data team that retrains weekly, the difference between “5 minutes on the analyst laptop” and “an hour on a paid GPU instance” is real-world friction.

Feature importance is debuggable. SHAP values, gain importance, split importance — every senior data scientist can read a tree-based importance plot, and so can the compliance officer who’s reviewing the model. Tabular neural-network methods provide importance signals, but they’re harder to explain to a regulator.

Hyperparameter resilience. The 2017-era folk wisdom about XGBoost — “set max_depth to 6, learning_rate to 0.05, n_estimators until early stopping kicks in” — still produces near-optimal results on 80% of tabular problems. TabPFN doesn’t need tuning at all, which is genuinely impressive, but XGBoost’s tuned performance is the bar TabPFN is measured against in production, not its default.

No GPU required. XGBoost runs on the laptop, the on-prem cluster, the Spark executors, the regulated bank’s air-gapped environment. The transformer-based methods need GPU time, and “we don’t have GPU capacity for tabular ML” is a sentence that gets said in real procurement meetings more than the contender vendors would like to admit.

Native handling of missing values. The sparsity-aware split in XGBoost means you don’t have to impute. Production tabular data has missing values in every column. Methods that require imputation add a preprocessing layer that’s another point of failure.

Twelve years of tooling. XGBoost has first-class Spark integration, a Dask integration, ONNX export, integrations into every major MLOps platform, explainability tooling that’s been battle-tested, and a community of millions of practitioners. Replacing it means replacing not just the model but a ring of infrastructure around it. The cost of that swap, multiplied by the number of pipelines, is what kills challengers.

The challengers that actually matter — LightGBM, CatBoost, TabPFN

The honest version of the gradient-boosting landscape is that XGBoost is the default but not the best on every shape of dataset. Two of the challengers are gradient-boosting libraries themselves; one is a transformer.

LightGBM, from Microsoft, is faster on large datasets thanks to histogram-based split finding and leaf-wise growth. For training jobs in the millions-of-rows-and-up regime, LightGBM is typically 2-5x faster than XGBoost at comparable accuracy. Many teams in 2026 default to LightGBM for training and convert to a unified inference layer.

CatBoost, from Yandex, dominates the high-cardinality-categorical benchmarks. Its ordered boosting and native categorical encoding mean it often outperforms XGBoost on datasets with many string features (user IDs, product IDs, geo strings) without requiring one-hot or target encoding preprocessing. The 2024 banking churn study across multiple splits gave XGBoost the highest accuracy in some splits and CatBoost the win in others — the lesson being that the right answer is “try both,” not “switch to CatBoost.”

TabPFN is the more interesting case because it’s a foundation model. TabPFN-2.5 (November 2025) handles datasets up to 50,000 rows and 2,000 features with no training step — you feed the labeled training data as context, then the unlabeled rows as queries, and the model predicts. The zero-training story is genuinely a paradigm shift for one specific use case: small datasets where you want a strong baseline in seconds without hyperparameter tuning.

The places TabPFN is winning real production work in 2026:

Prototype and rapid baseline. Data scientists use TabPFN to get a strong baseline number in seconds, before deciding whether to spend a week tuning XGBoost.
Small-data scenarios. Clinical trials, A/B tests with limited conversions, niche enterprise data products where the dataset is fundamentally small.
AutoML pipelines. TabPFN is increasingly the default model inside AutoML platforms that need a strong out-of-the-box result without a human in the tuning loop.

Where TabPFN is not winning: anything above the dataset-size threshold (roughly 100K rows for TabPFN-2.5), anything where inference latency matters (TabPFN’s inference is a forward pass through a transformer, not a tree walk), and anything where GPU availability is constrained.

What Uber, Airbnb, and the fintech world actually run

The production stories are unsexy and consistent.

Uber’s Michelangelo platform spent its first phase (2016-2019) running tree-based models — XGBoost specifically — at the centre of every tabular use case: ETA prediction, demand forecasting, pricing, fraud, ad targeting. By 2025, Michelangelo had expanded to include LLMs and embeddings, but the tabular backbone is still gradient-boosted trees. Uber published an engineering post specifically about productionising distributed XGBoost — Spark-based training on hundreds of millions of rows — and the resulting models power roughly a million predictions per second.

Airbnb’s pricing and ranking models are gradient-boosting under the hood. There are public-facing references to XGBoost ranking systems at Airbnb and a long tail of academic and practitioner work using XGBoost for Airbnb price modelling. The pattern is the same: the headline ML system at the front of the homepage is a deep-learned ranker, but the tabular features feeding into pricing, fraud, and trust-and-safety are gradient-boosted.

Fintech fraud detection across the industry — Stripe, Klarna, the major card networks, the regional banks — runs gradient-boosted trees as the core model. The combination of class imbalance, mixed numerical and categorical features, the need for fast retraining, and the regulator’s appetite for explainable models is exactly the shape gradient boosting solves. Some fintechs have layered embedding-based representation learning on top of the gradient boosting (as features), but the final classifier is still a tree ensemble.

The pattern is consistent: in enterprise tabular ML, gradient boosting is the model. Other techniques get layered into the feature engineering, or into special-case re-rankers, or into AutoML baselines. The production prediction, the one that decides whether the credit card transaction clears, is a tree walk.

The honest critique — where XGBoost loses

Being fair: XGBoost is not optimal everywhere.

It loses on truly massive datasets (billions of rows) where LightGBM’s training speed advantage and Spark integration eclipse XGBoost.

It loses on datasets with thousands of high-cardinality categoricals without good encoding, where CatBoost’s ordered boosting is meaningfully better.

It loses on truly small datasets (under 1,000 rows) where TabPFN’s foundation-model prior beats anything trained from scratch.

It loses on time-series with rich structure where N-BEATS, Temporal Fusion Transformer, and the newer foundation models for time series genuinely win.

It loses on tabular data that is fundamentally text-like or image-like — which is to say, it’s not really tabular. If your “tabular” data is a wide table of token frequencies, you should be using an LLM or a vision model.

Those edge cases account for, optimistically, 20% of enterprise tabular work. The other 80% is the median problem — a few hundred thousand to a few million rows, a mix of numeric and low-cardinality categorical features, a deadline, a CPU budget, a compliance reviewer — and on that problem, XGBoost is still the answer.

What changes by 2028?

The honest forecast: probably less than the contender camps would like. The trajectories most likely to actually move the centre of gravity:

Foundation models for tabular data continue to improve, but their dataset-size ceiling is the constraint to watch. If TabPFN-3 (or a successor) handles a million rows in seconds on a single GPU, the default starts to wobble. Until then, it doesn’t.
LightGBM quietly eats more of the “big data” XGBoost workloads. The shift here is gradual and culturally invisible — teams retrain into LightGBM during platform migrations, and nobody writes a blog post.
CatBoost keeps growing in domains where high-cardinality categoricals dominate — recommendations, fraud, programmatic ad bidding.

But the dominant story in 2028 is probably the same as the dominant story in 2026: gradient boosting is the tabular default. The specific library changes; the paradigm doesn’t.

What to take away

Gradient boosting is the answer to ~80% of enterprise tabular ML problems. XGBoost specifically is the default within that 80%.
The challengers are real, but specialised. CatBoost for high-cardinality categoricals. LightGBM for very large datasets. TabPFN for small datasets and rapid prototyping. Use them where they win, don’t replace the default.
The reason XGBoost endures is not technical excellence; it’s the switching cost. Every advantage individually is modest. The composite of “fast, debuggable, no GPU, missing-value tolerant, mature tooling” is the moat that the contenders haven’t crossed.
If a transformer-tabular paper announces the death of XGBoost again next year, read the dataset-size column in the benchmark table. The answer is usually there.

Twelve years on, the most accurate framing for tabular ML is what one Kaggle grandmaster told me last year: “I try XGBoost first, then I try CatBoost or LightGBM if I have a reason. I try TabPFN if the dataset is small or I’m in a hurry. Then I usually go back to XGBoost.” That’s the 2026 state of the art. It’s not glamorous. It works.

Further reading: NVIDIA’s Kaggle Grandmasters Playbook is the best practitioner write-up of how the top of the field actually works. The TabPFN-2.5 paper is worth reading carefully — the benchmark tables tell a more nuanced story than the headline. Uber’s productionising distributed XGBoost post is the cleanest open description of what tabular ML looks like at real scale.