Operator, Claude Computer Use, and Project Mariner: the browser agent shootout

In the twelve months between October 2024 and October 2025, three browser agents shipped from the three frontier labs. Anthropic went first with Claude Computer Use. OpenAI followed in January 2025 with Operator, later absorbed into ChatGPT agent. Google’s Project Mariner had shipped a research preview in December 2024 and became GA through Gemini in 2025.

Each one took a different bet on the fundamental abstraction: what does the model actually see when it looks at a webpage? Pixels? The DOM? Accessibility nodes? The answer to that question turns out to determine almost everything else — latency, cost, accuracy, deploy story, and which business problems the agent can plausibly solve.

This is the receipts post. Two years in, here is which approach won where, the numbers behind the marketing, and why “browser agent” is still not a synonym for “trusted assistant.”

The three abstractions

Three labs, three abstractions. Operator paints the model into the same role as a human user. Computer Use exposes the OS as a tool. Mariner lives inside Chrome and inherits your session.

The differences sound small in slides and are large in practice.

Operator (and its successor ChatGPT agent) runs in a remote virtual machine OpenAI hosts. The agent observes a screenshot every step, decides on an action like “click at (812, 343),” and OpenAI’s infrastructure executes it. The model is a vision-tuned GPT variant trained with reinforcement learning on UI tasks — the Computer-Using Agent (CUA) paper describes the reward shaping. The defining property is that the agent sees what a human sees and does what a human does.

Claude Computer Use is structurally the same — screenshot in, coordinates out — but framed differently in the SDK. Claude calls computer, bash, and text_editor as ordinary tools. There’s no hosted VM by default; you bring your own Linux box (Anthropic publishes a Docker image). The model is not a separate “agent model” — it is ordinary Sonnet or Opus with computer-use tools wired in. This means Computer Use ships every time Sonnet ships, and Sonnet has improved faster than any equivalent specialist agent model.

Project Mariner lives inside Chrome as a browser extension. It does not need screenshots; it reads your tab’s DOM and accessibility tree directly. It emits structured “click this element with this selector” actions that Chrome executes. Crucially, it shares your authenticated session — your Gmail, your Amazon, your bank — because it is your browser.

What the benchmarks actually say

The two benchmarks that matter are WebVoyager (real live websites: Amazon, Apple, GitHub, Google Flights, GitHub) and OSWorld (a sandboxed desktop with browser, terminal, files). Numbers as of mid-2026:

System	WebVoyager	OSWorld	Notes
OpenAI CUA / Operator	87.0%	38.1% (initial); higher in ChatGPT agent	Jan 2025 paper
Claude Computer Use (3.5 Sonnet, Oct 2024)	—	14.9% screenshot-only, 22.0% with more steps	First-ever release
Claude Sonnet 4.5	—	61.4% on OSWorld-Verified	Sep 2025
Claude Sonnet 4.6	—	72.5% on OSWorld-Verified
Claude Opus 4.7	—	78.0% on OSWorld-Verified	May 2026, current SOTA
Project Mariner (Gemini 2.0)	83.5%	—	Dec 2024 announcement
Browserable	90.4%	—	Open-source, screenshot-based
Magnitude	94.0%	—	Open-source, accessibility-tree

A few honest observations.

WebVoyager has been saturated — the gap between Operator’s 87% and the best open-source agent at 94% is mostly engineering, not model capability. The remaining errors are dominated by site-specific quirks (cookie banners that drift, captchas, A/B-tested layouts) that no agent generalises perfectly to.

OSWorld is the harder benchmark and the more honest one. The progression of Claude’s score from 15% to 78% over 19 months is the most credible “agents are getting better” datapoint in the field. The OSWorld score crossing 75% is what makes 2026 the year a serious enterprise can pilot a desktop agent for a narrow workflow — though not yet trust it unattended.

Project Mariner’s 83.5% is a strong number, but the abstraction it benchmarks against (CSS selectors on cooperative sites) is also the easiest one. Where Mariner’s approach really pays off is in the long tail: it can do things Operator structurally cannot, because it has your cookies.

Cost and latency: the gap that no benchmark captures

The pricing models, as of May 2026:

ChatGPT agent (formerly Operator): subscription-based, $200/month for ChatGPT Pro. The compute is bundled.
Claude Computer Use: API-priced per token. Roughly $15 in input tokens and $25 in output tokens for a typical 30-step task using Opus 4.7 ($5 in, $25 out per million tokens). Sonnet 4.6 is a quarter of that.
Project Mariner: bundled in Gemini Pro / Google AI Ultra tiers; standalone access reported around $250/month.

The per-task cost numbers vary wildly by task length. A simple “find me the cheapest flight to Tokyo next week” is dozens of cents on Operator or Computer Use; a “research and book a five-day itinerary” can run into single-digit dollars. The cost differential between Mariner and the others is real and structural: Mariner doesn’t need to send screenshots through a vision model on every step. Reading the DOM is dramatically cheaper than encoding a 1080p screenshot.

The latency story is similar:

Operator and Computer Use are bottlenecked by vision encoding and screenshot transport. A typical step is 8-15 seconds wall time. A 30-step task is 4-8 minutes.
Mariner’s per-step latency is closer to 2-4 seconds because it skips the screenshot. A 30-step task is closer to 90 seconds.
All three are vastly slower than a human doing the same task. Browser agents are not yet fast; they are unattended. That is the value proposition.

Wall-clock latency per agent step on a representative task. The screenshot-based agents pay a 3-4x tax for vision. Humans remain dramatically faster than all three.

Where each one actually wins

Operator / ChatGPT agent wins when the task is on a site you can’t or don’t want to instrument. The “go to this random vendor portal, fill in the form, download the PDF, summarise it” workflow is its sweet spot. Because it runs in a hosted VM, your credentials are not in your local browser, which is a security argument for and against the approach. ChatGPT agent’s integration with the rest of ChatGPT — research, code execution, file output — is the real moat.

Claude Computer Use wins when you want the agent embedded in your infrastructure. Build a customer-support agent that opens Salesforce, checks an internal admin tool, and triggers a refund? Computer Use is the only one of the three that you can ship inside your own VPC. Anthropic publishes the Docker reference image and the agent is “just” Sonnet with tools — so the same model powering your chat surface can also pilot a browser.

Project Mariner wins when the task is “act as me, with my session.” Buying things you already have an account for, navigating your own Gmail, triaging your own inbox, comparing prices across sites you’re logged into. The cost and latency advantages are real. The privacy story is the opposite of Operator’s — Mariner sees everything in your browser by design.

Concrete deployment examples published in 2025-26:

DoorDash and Uber Eats integrations featured in Operator’s launch — booking orders end-to-end through a hosted browser session.
Anthropic’s enterprise customer references for Computer Use in legal-doc review and back-office form filling (the agent reads a screenshot of a vendor portal, fills out an expense claim, submits).
Mariner’s headline use cases in Google’s announcements: shopping price comparison, multi-tab research, sending emails based on calendar context.

The trust gap

Here is the part the benchmarks don’t measure. WebVoyager success rates around 85% mean roughly one in six attempts fail in some interesting way: the agent clicks the wrong button, confirms a purchase you didn’t want, loses track of the task halfway through.

The way each lab handles that gap is the most honest signal of how “agentic” the products really are:

Operator inserts a human-confirmation step before any irreversible action — checkout, payment, message-send. This is great for trust and miserable for throughput; the user spends meaningful time clicking “yes, confirm” through long tasks.
Claude Computer Use leaves the policy to the developer. You can have it auto-confirm or human-confirm; Anthropic ships strong prompt injection warnings and recommends VM isolation.
Mariner deliberately stops short of purchases. Google’s documented stance is that Mariner can fill the cart but not press buy. (This is changing in 2026 with Mariner’s “Active Tab” purchase tests, but the default position remains conservative.)

All three labs are signalling the same thing: the technology is past the “can it do the task” question and into the “do we trust it unattended for this class of task” question. For information gathering — research, price comparison, form pre-fill, calendar wrangling — the answer is becoming yes. For irreversible commitments — purchases, signed contracts, sent emails to important people — the answer is still no, and any launching CEO who claims otherwise is either marketing or has chosen to absorb the liability.

Three failure modes from the field

Two years of deploying these in earnest has surfaced a recurring failure catalogue that doesn’t show up in the WebVoyager scores.

1. Cookie banner roulette. All three agents are visibly worse on sites with aggressive cookie consent dialogs, particularly the GDPR-era ones that A/B-test layouts. The agent clicks “accept” when it should have clicked “reject,” or — more embarrassingly — clicks the dialog itself when the intended target is behind it. Mariner’s accessibility-tree approach is least affected here because the consent overlays are usually structurally distinguishable in the DOM. The screenshot agents have to see the overlay and reason about it.

2. Authentication walls. Operator and Computer Use start every session without your cookies. That’s a security feature; it’s also why they fail on “show me my last three Amazon orders” without you logging in inside the sandbox. The handoff for credentials is awkward — Operator asks you to take over, log in, then resume — and burns 30-60 seconds of wall time. Mariner inherits your session by definition and skips this entirely. The asymmetry shapes which workflows each one wins.

3. Visual ambiguity in modern UIs. Card-based dashboards with five “View” buttons, four “Edit” links, and three different “Submit” CTAs break the screenshot-and-coordinates agents in ways that are not captured by WebVoyager’s curated task set. The accessibility-tree agents tolerate this better because the DOM usually disambiguates the elements. The screenshot agents are improving — Sonnet 4.7’s vision is materially better than 3.5’s at parsing dense UIs — but the modality gap remains real.

The labs are converging on the same answer: layered approaches. Anthropic’s Claude 4.6 release notes hint at internal experiments combining screenshots and a parsed DOM representation. OpenAI’s ChatGPT agent already does element-level introspection alongside vision. Mariner is the closest to a “pure” DOM approach but is reportedly experimenting with vision fallback for canvas-heavy sites. Within twelve months expect the three approaches to look more similar — multimodal, with both pixel and structural signals.

Where the open-source field sits

The frontier labs are the headline acts; the open-source ecosystem is where the deployable templates live. A non-exhaustive map of what’s moved into production:

Browser Use wraps Playwright around any LLM with a clever DOM-flattening prompt. It is the most-deployed open-source browser agent of 2026 and powers a long tail of internal automation projects. Performance lags the frontier labs by 10-15 points on WebVoyager, but you control everything.
Browserbase offers hosted Playwright-as-a-service for agent backends — popular as the substrate underneath custom agents that want Operator-style remote execution without OpenAI’s pricing.
Stagehand (Browserbase’s higher-level abstraction) is the closest thing the field has to “Tailwind for browser agents” — a small set of primitives like act(), extract(), observe() that compose into real workflows.
Magnitude and Browserable are the open-source projects topping WebVoyager leaderboards. Both lean on accessibility trees and a re-planning loop.

The pattern across all of them: the open-source community has outperformed the frontier labs on benchmarks while the labs hold the lead on production polish, safety guardrails, and ecosystem reach. If you are building a product, the labs are still the better default. If you are building infrastructure, the open-source kit is now genuinely good.

The contrarian opinion

2026 is the year browser agents stop being demos. The benchmark numbers crossed the threshold; the cost per task fell by an order of magnitude; the wall-clock latency is finally compatible with “unattended background task.” For the right workloads — research, reconciliation, form filling, status-checking — these tools are ready to be a part of your stack.

2026 is not the year browser agents become trusted purchase agents. The last 5-10% of WebVoyager accuracy maps to “the times your AI almost bought the wrong flight,” and the labs know it. The gap between “completes the task” and “completes the task reliably enough that you’d let it spend your money” is wider than any benchmark captures.

If I were building a product on top of these today, the rule I’d apply is bind the agent to read-only tasks until the eval set has tens of thousands of clean traces; then individually gate each write surface with explicit human confirmation, with the confirmation rate as a first-class metric to drive down over time. The labs are doing exactly this in their own products. The teams burning fast through trust are the ones not.

Further reading: OpenAI’s Computer-Using Agent post is the canonical technical reference for the screenshot-and-RL approach. Anthropic’s computer use documentation is the most practical implementation guide. Google DeepMind’s Project Mariner announcement makes the case for the extension-based design. For the broader landscape including open-source alternatives like Browser Use, the aimultiple comparison is the most thorough running survey.