datarekha
Infrastructure May 14, 2026

Computer-use latency engineering: getting browser agents under a second

A naive browser-use loop is 4-8 seconds per step. Production systems run at 600ms. The gap is closed by half a dozen techniques — differential screenshots, prompt caching, batched actions, vision-model routing — each of which sounds boring until you measure the difference.

13 min read · by datarekha · latencycomputer-usebrowser-agentsperformance

If you’ve used Anthropic’s Computer Use or OpenAI’s Operator in mid-2026, you’ve noticed something specific: Claude clicks through a flight booking in 90 seconds; Operator takes 4 minutes for the same task. The published OSWorld benchmark numbers tell part of the story — Claude Opus 4.7 at 78.0%, OpenAI’s Operator reportedly at 38.1% — but accuracy isn’t latency. The latency gap is its own story, and the techniques that close it are the most interesting infrastructure work happening in agents right now.

This post is the engineering breakdown: where the 4–8 seconds of a naive browser-agent loop go, the six techniques production teams use to collapse it under 1 second, and the architectural choices that distinguish Anthropic’s approach from OpenAI’s.

The naive loop, decomposed

A browser-use agent’s inner loop has five steps:

BROWSER-USE LOOP — STEP BUDGET1. SCREENSHOTcapture page~80ms2. ENCODEtokenize image~800ms3. LLM CALLdecide action~1.5s–4s4. PARSEextract action JSON~30ms5. EXECUTEclick / type~400ms (page settle)loopNaive serial total per step: ~3–5 seconds. A 50-step task takes 2.5–4 minutes wall clock.Production target: ~600ms per step. Same 50-step task: 30 seconds.
The five steps of a browser-use agent. The LLM call and the image encode dominate; everything else is in the noise.

Two observations from the breakdown:

  1. The LLM call dominates. 60–80% of the per-step time is the model. Vision models with screenshot input are slower than text-only models — each screenshot adds about 800ms of encoding latency per Browser Use’s published numbers.
  2. Image encoding is the second-biggest cost. Roughly one screenshot per step, each ~700–1024 vision tokens, each adding inference latency on top of the input tokens for the conversation history.

Closing the gap means attacking both. Here are the six techniques that matter.

Technique 1: prompt caching the conversation history

This is the biggest single win and the cheapest to implement. Anthropic shipped extended-TTL prompt caching in 2025: a 1-hour cache that reduces costs by up to 90% and latency by up to 85% on cached prefixes. The relevant detail for browser agents is the prompt ordering.

The naive ordering is:

[system prompt] [history of N actions] [latest screenshot] [user query]

The cache-aware ordering Browser Use documented is:

[system prompt] [history of N actions] [user query] [latest screenshot]

Why it matters: the system prompt + history + query are stable across turns. The screenshot is fresh. By putting the screenshot last, the entire conversation history hits the prompt cache. On a 10K-token system + history prefix, the difference is a 5–10x cost reduction and 85ms shaved off the TTFT for the cached portion.

The interesting nuance: in turns 2–N of an agent loop, the conversation history grows by one action per turn. If you’re not careful, the cache breaks every turn because the suffix-before-screenshot keeps changing. The fix is appending to the cached prefix, which both Anthropic’s SDK and the OpenAI Responses API now support.

Technique 2: differential screenshots

The second biggest win, and the one that requires more engineering. A typical browser-use loop captures a full screenshot every step. If the page didn’t change much — and on 60–70% of steps it didn’t — you’re encoding a near-identical image.

The differential technique:

  1. After every action, compute a perceptual hash (or bounding-box diff) of the new screenshot vs. the previous.
  2. If the diff is below a threshold, don’t send the full screenshot. Send a small textual delta (“the dropdown at coords (340, 220) opened”) plus the previous screenshot reference, which the model already has in its cached context.
  3. If the diff is above threshold, send the full screenshot.

Done right, this cuts the average vision-token cost per step by 40–50%. Anthropic’s Computer Use reference implementation ships a version of this in their Docker container; Browserbase has published variations. The trade-off: there’s a small accuracy hit because the model occasionally guesses wrong about what changed. In practice, the eval sets show under 2% degradation.

Technique 3: small vision models for the “did anything change” gate

A close cousin of differential screenshots. Instead of computing the hash deterministically, you can run a small purpose-built vision model (GPT-4o-mini-vision, Gemini Flash) over the new screenshot and ask “did anything material change since the last action?” The small model answers in 200ms; if it says no, you skip the big model entirely and execute the next pre-planned action.

This is the routing pattern applied to vision. The economics are striking — the small vision model costs ~5% of the big model and skips 30–40% of big-model calls on workflows like form-filling and multi-step shopping carts. The risk is that the small model misses real changes; in practice the production fix is to occasionally (every 5 steps) run the big model anyway as a sanity check.

Technique 4: batched actions

The naive loop assumes one LLM call per action. But many real browser workflows have predictable multi-action sequences: fill name, fill email, fill address, click submit. If the agent can predict the next 3–4 actions from a single screenshot, it can execute them as a batch without re-prompting the model.

Anthropic’s Computer Use exposes a type action that takes a multi-character string in a single call. OpenAI’s Operator takes individual keystrokes by default, which is part of why it’s slower. The Anthropic team’s engineering posts have been increasingly explicit about this — exposing batch primitives in the tool surface is one of the single largest latency wins, because each saved LLM call is 1.5–4 seconds of wall clock.

The complication is rollback. If action 3 of a 4-action batch fails, the agent needs to recover, which means observing the failure state and re-planning. This is doable but adds engineering complexity that naive implementations skip.

Technique 5: predictive prefetch

The most aggressive optimization. While the agent is executing the current action (~400ms of page-settle time), the next LLM call can start in parallel — predicting what the next action will likely be based on the current trajectory. When the page actually settles, you either accept the prediction (if it matches reality) or discard it (if the page is different than expected).

This is harder than it sounds because LLM calls aren’t free — every prefetched call you discard is wasted compute. In practice production teams use it selectively, for high-confidence paths like “after typing in a search box, the next action is almost certainly clicking the search button.” On those flows, prefetch cuts another 30% off perceived latency.

Browser Use’s blog has the cleanest public write-up. The technique is mature enough that both Anthropic and OpenAI offer something resembling it in their hosted agent offerings, though neither has published exact details.

Technique 6: speculative execution with verification

The most sophisticated variant. The agent maintains two parallel LLM streams: one running the current step, one running a “verifier” pass that double-checks the action right before execute. If the verifier disagrees, the execute is aborted.

This sounds slower (two LLM calls) but isn’t, because the verifier runs on a small fast model and starts in parallel with the page settle. Net effect: occasional safety wins at near-zero latency cost. Few production systems use this — the engineering bar is high — but the ones that do (Anthropic’s internal systems, parts of Cognition’s Devin) report meaningfully better task success rates.

The Anthropic vs OpenAI architectural bet

The published latency numbers tell two different stories:

Anthropic Computer UseOpenAI Operator
Median step latency600ms2.1s
OSWorld score78.0% (Opus 4.7)38.1%
SurfaceGeneric computerBrowser-anchored
Batched actionsYes (multi-char type)Limited
Prompt cachingYes (1-hour TTL)Yes (limited)
Vision modelSingle tier (Opus/Sonnet)Single tier

(Numbers compiled from Coasty’s 2026 benchmarks, BenchLM agentic leaderboard, and Anthropic’s own computer use documentation.)

The architectural bets differ:

Anthropic went generic. Computer Use is positioned as “Claude can use any computer interface” — keyboard, mouse, arbitrary apps. The reference implementation is a Docker container with a virtual X11 display. The downside: a generic interface is slower than a purpose-built one. The upside: same model architecture works on browser, desktop, and (theoretically) mobile, and they get to amortize improvements across all surfaces.

OpenAI went browser-specific. Operator is anchored to a managed browser environment with structured DOM observations and screenshots. The downside (in 2026): Operator’s accuracy on non-browser tasks is essentially zero, and the published OSWorld scores reflect that. The upside: when the browser is the entire world (booking, shopping, form-filling), the structured DOM gives more reliable element targeting than vision alone.

The latency numbers favor Anthropic because their bets are smaller per step — batched actions, longer prompt caches, smaller per-step overhead. OpenAI’s architecture asks the model to do more work per turn, and the wall clock shows it.

PER-STEP LATENCY — NAIVE vs OPTIMIZEDNAIVEscreenshotencode 800msLLM call 3000ms~4.0s+ CACHE~2.0s+ DIFF~1.4s+ BATCH~600msEach row composes the optimizations above it. The cumulative effect is ~6.5x: from 4 seconds per step to 600ms.
Cumulative latency reductions. Numbers are illustrative; real workloads vary, but the shape of the curve is consistent across published benchmarks.

The trap: optimizing latency at the cost of reliability

Every one of these techniques trades a small amount of reliability for a large amount of speed. Differential screenshots occasionally miss a real change. Predictive prefetch occasionally discards a useful call. Batched actions occasionally fail mid-batch in messy ways.

The teams that win this don’t blindly stack optimizations — they measure each one against an eval set and accept the ones where the reliability hit is under 2% and the latency win is over 20%. Less sophisticated teams stack until something obviously breaks in production. The order matters: prompt caching is essentially free reliability-wise, differential screenshots cost almost nothing, batched actions and predictive prefetch are the riskier ones.

Anthropic’s Building Effective Agents paper has a sentence about this that’s worth quoting: “no new complexity without a measured win.” Same rule applies in latency engineering. Run the eval. Keep what wins. Drop what doesn’t.

What to take away

  • The naive loop is 4–8 seconds per step. The production target is 600ms. Closing the gap is engineering, not magic.
  • Prompt caching is the biggest single win. Get the prompt ordering right (history before screenshot) and let the cache do its job.
  • Differential screenshots + small-model gating cut the second-biggest cost — image encoding — by 40–50% on average.
  • Batched actions are the secret weapon of Anthropic’s Computer Use. The OSWorld latency gap with Operator is mostly here.
  • Every optimization trades a little reliability for a lot of speed. Measure both. Stack only the trades you can defend.

The boring infrastructure work on latency is what separates a demo-quality browser agent from a product-quality one. Anthropic’s engineering team has been more aggressive about publishing these techniques than OpenAI’s, which is partly why their Computer Use feels faster — and partly why, in mid-2026, “computer use” almost always means Claude’s flavor of it.

The deeper read on the latency gap: it’s not really about model inference. It’s about what the agent’s tool surface lets it skip. Batched actions, differential screenshots, and cache-aware prompt ordering all share the same shape — they let the agent do less work per LLM call. The teams that win the latency race are the ones who design the tool surface around what the model doesn’t need to see, not just what it does.


Further reading: Browser Use’s speed engineering post is the most detailed public write-up. Anthropic’s Computer Use blog has the architectural framing. Coasty’s 2026 OSWorld breakdown is the cleanest independent benchmark report.

Skip to content