The agent observability stack: LangSmith, Langfuse, Helicone, Arize

There is a moment, somewhere between the first demo and the first production incident, where every agent team has the same realisation: we have no idea what this thing is actually doing. The trace from the last user complaint is in CloudWatch, the prompt template is in someone’s branch, the eval set is in a Jupyter notebook on a laptop, and the cost dashboard is a Stripe invoice that arrived two weeks late.

That moment is when teams discover agent observability. And in 2026, unlike in 2024, there’s a real stack to discover. The market has consolidated into five names you keep hearing — LangSmith, Langfuse, Helicone, Arize (with its Phoenix open-source layer), and Braintrust — plus the major APM vendors (Datadog, New Relic, Honeycomb) who have all shipped LLM observability features. They are not all the same product. Pretending they are is the single most expensive procurement mistake an agent team makes.

This post is about how the stack actually splits, what each layer is for, and who wins which layer.

The three layers of agent observability

A useful frame, because vendors all claim to do “everything”: agent observability is three distinct workloads stacked on top of each other.

Traces are debugging. Evals are CI. Drift is production monitoring. Three different jobs, three different consumers.

The teams that get this right approach the stack one layer at a time. The teams that get it wrong sign a year-long enterprise contract with whichever vendor’s demo was loudest at NeurIPS and then realise six months later that they bought a trace viewer when what they needed was an eval harness.

Layer 1 — Traces, where the market actually competes

The trace is the atomic unit. One user request becomes a tree of spans: the orchestrator LLM call, each tool invocation, the retrieval lookups, the follow-up LLM calls, the final completion. For every span you want at minimum the model, the prompt, the completion, the token counts, the latency, the cost, and the parent span.

This used to be a bespoke per-vendor schema. The good news of 2026 is that OpenTelemetry’s GenAI semantic conventions, shepherded since April 2024 by the OTel GenAI SIG, have become the lingua franca. The conventions standardise the attribute names — gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens — so a trace emitted from one SDK lands cleanly in any backend that speaks OTel. Most of these are still in “experimental” status as of March 2026, but every major vendor on this list now ingests them.

That standardisation is the single biggest reason the trace layer is now boring in a good way. You instrument once and you can swap vendors without rewriting the agent.

Five players matter here, and they win for different reasons:

LangSmith is the deepest LangChain/LangGraph integration on the market. If your agent is written in LangChain, LangGraph, or even just uses the LangChain expression language for tool definitions, LangSmith captures everything — prompt templates, chain inputs, retrieval scores — without extra instrumentation. The traces are the polished part of the product; the trace viewer alone is worth the price of admission for LangChain shops.

Langfuse is what you reach for when LangSmith’s hosted-only model clashes with your security review. Open source, MIT-licensed, self-hostable on Postgres + ClickHouse, with full feature parity in the OSS build. Langfuse has aggressively adopted OpenTelemetry, so any framework that emits OTel- shaped spans (and most do now) shows up correctly. Most teams that picked Langfuse picked it because their compliance team would not approve a third- party SaaS for raw prompt and completion data.

Helicone wins on install simplicity. It is a proxy: you point your OpenAI/Anthropic SDK at oai.helicone.ai, add an API key header, and you have traces, costs, and caching working in under ten minutes. The trade-off is that the proxy model has shallower per-framework integration — you see the LLM calls, not the orchestrator structure around them. For teams who are mostly worried about cost visibility and don’t need to inspect tool-call arguments, this is exactly the right shape.

Arize Phoenix brings ML-observability heritage. Arize was an ML-monitoring company before the LLM wave, and Phoenix (their open-source project) is a serious trace + eval workbench that runs locally as a notebook companion or as a deployment. The integrations skew toward LlamaIndex and the OpenAI Agents SDK. For teams who came from traditional ML and want their LLM traces to live in the same mental model as their feature-drift dashboards, Phoenix is the obvious pick.

Braintrust is the late entrant that grew up around evals (more on that in the next section) and added tracing later. The tracing is fine. The reason to be on Braintrust is rarely the trace viewer.

Layer 2 — Evals, where the market actually differentiates

If traces are a debugging artifact, evals are a CI artifact. The eval layer answers a different question: when I change the system prompt, the model, the tool definitions, or the retrieval setup, did the agent get better or worse on the things I care about?

The shape of an eval system is roughly the same across vendors: a curated dataset of inputs with expected behaviour, a scorer (code metric, LLM-as- judge, or human), and a regression dashboard that compares candidate runs to a baseline. The differentiation is in how seriously the vendor takes the scorer layer — particularly the LLM-as-judge.

A crude but useful heatmap of where each vendor concentrates investment. Circle area is rough strength on the layer.

Braintrust grew up around this layer. Its primitive is the eval suite — a versioned dataset, a set of scorers, and a leaderboard of experiments. The trace view is built on top of the eval view, not the other way around. Teams that want eval-first workflows (think: a senior engineer who runs twenty prompt experiments before lunch and needs to diff them) gravitate here. Braintrust has also published one of the more thoughtful guides to LLM observability that treats traces as derivative of evals, not the inverse.

LangSmith’s eval product has caught up significantly since 2024. Datasets, LLM-as-judge templates, regression dashboards — all there, all polished, all slightly easier to use if you’re in the LangChain ecosystem.

Langfuse ships a credible eval layer with the bonus that the datasets and scoring code are in the open-source repo. For teams that want their evals to be inspectable and reproducible by a regulator, this matters.

Arize Phoenix is the one with serious ML-evaluation heritage. The eval primitives — particularly around hallucination detection, retrieval quality, and groundedness — are deeper than the LangChain-native tools because Arize’s team built them in the ML-observability era when “is this model output drifting from training distribution?” was already a well-understood question.

Helicone has evals, but they’re not the centre of the product. Cost guardrails and prompt experimentation are.

Layer 3 — Drift, where ML heritage matters

Drift is the layer most teams underinvest in. The question it answers: your agent passed evals on Monday. Is it still passing the equivalent of those evals against the real distribution of user requests on Friday?

This is a different question than traces (which show one run) or evals (which show one dataset). Drift requires sampling live traffic, scoring it inline, and comparing the score distribution against a baseline. The infrastructure looks more like classical ML monitoring than like prompt-engineering tooling.

Arize wins this layer outright, because they built the underlying infrastructure for vector-distribution drift, embedding drift, and feature drift before LLMs were a product category. Their commercial product (Arize AX) and the Phoenix open-source layer share that lineage. Datadog has built a credible version on top of their existing APM, and for shops that already pay Datadog, that integration is hard to argue with.

Langfuse and LangSmith have drift features, but they’re newer and shallower. For a team that genuinely needs to detect a slow regression in production quality two weeks before it hits a customer complaint, those features today are not yet where the dedicated ML-monitoring tools are.

Picking your vendor — the actual decision tree

The vendor pages all read the same. The real questions are:

Is your stack LangChain/LangGraph? Default to LangSmith. The depth of integration is unrivalled. The only reason to pick something else is the next question.
Do you need to self-host raw prompts and completions for compliance? Langfuse. Phoenix is the also-ran. LangSmith and Braintrust have enterprise self-host tiers but they cost noticeably more.
Is “cost visibility in an afternoon” the actual goal? Helicone. The proxy is genuinely a ten-minute install. You can layer something else on top later.
Do you have a regulated ML practice and care about drift more than about trace polish? Arize, and specifically the AX commercial tier. Phoenix for the prototype, AX when production scale arrives.
Are you eval-first and the team is going to write a hundred evals a week? Braintrust. The dataset-and-experiment primitive is the cleanest on the market.
Already paying Datadog or New Relic? Use their LLM observability features for layer 1, and revisit layers 2 and 3 as separate purchases. The trace-only need can be served by your existing APM contract.

The mistake almost everyone makes: shopping for a single vendor when the three layers don’t need to live in one place. Plenty of teams run Helicone or Datadog for cost-and-trace, Braintrust or LangSmith for evals, and Arize for production drift. OpenTelemetry’s GenAI conventions make the seams livable. The teams who insist on one-vendor-for-everything end up either with a great trace viewer and a mediocre eval layer (LangSmith if you don’t live in LangChain) or a great eval suite and a thin drift story (Braintrust).

What about the major APM vendors

Datadog, New Relic, and Honeycomb have all shipped LLM observability features. They are real products, particularly for trace ingestion at scale, and they integrate neatly with the rest of your APM. The honest assessment in 2026: they win for layer 1 in shops where the APM contract already exists, and they struggle on layer 2 because eval workflows are a different shape from infrastructure monitoring. They are a perfectly defensible choice for the trace layer; pair them with a dedicated eval tool.

Anti-patterns worth naming

A handful of things I’ve watched teams do, repeatedly, that I’d skip:

Buying the enterprise tier of one vendor for everything. See above. The market is genuinely specialised; one-vendor-for-everything almost always means giving up depth on two of three layers.
Picking based on “free tier is generous.” The vendors have all converged on similar free tiers. Pick on fit; the free tier matters for three months, the fit matters for three years.
Treating prompts as logs. Prompts and completions can contain PII, customer-confidential text, and sometimes secrets pasted by users. Whichever vendor you pick, set retention deliberately and turn on field-level redaction. The same vendor will happily store your data forever if you don’t tell it not to.
Not instrumenting tool calls. Plenty of teams instrument the LLM calls and skip the tool calls because the LLM calls are where the cost is. The tool calls are where the bugs are. Instrument both, always.

What to take away

Three layers, not one product. Traces, evals, and drift are three distinct workloads. Vendors who claim to do all three usually have one strong layer and two thin ones — recognise which one is theirs.
OpenTelemetry has made the trace layer portable. Pick a vendor for fit, not for fear of lock-in. The GenAI semantic conventions mean you can swap.
Evals are where the real differentiation is. This is the layer where the eval-first vendors (Braintrust) and the ML-heritage vendors (Arize) pull ahead of the trace-first vendors. If your bottleneck is “do I know whether this PR helps or hurts,” buy for evals first.
Drift is the next layer most teams will discover the hard way. Have a plan before you need one. Production behaviour will drift; the question is whether you’ll see it before your users do.

Two years from now, agent observability will look as uncontroversial as APM looks today — a checkbox on every infrastructure list, a line item on every team’s budget. We’re not there yet. In 2026, picking the right layer-and-vendor combination is still a competitive advantage. Treat it like one.

Further reading: OpenTelemetry’s GenAI observability post is the cleanest write-up of the semantic conventions. Hamel Husain’s evals essay remains the best practitioner intro to the layer-2 workflow. Langfuse maintains an honest comparison page of its own product against Arize and Phoenix; the comparison is more candid than most vendor pages.