Browser agents in production: Manus, BrowserBase, and Stagehand

The first time you ask Manus to “find me the cheapest flight from SFO to Berlin next Tuesday and book it,” and ten minutes later it comes back with a confirmation number, it feels like the future arrived early. The second time, it spends fifteen minutes clicking through a CAPTCHA wall on the airline’s site and burns through 900 of your credits before timing out. The third time, it books the right flight on the wrong date because it misread “Tuesday” as “Thursday” on a date picker that opens sideways.

This is, more or less, the state of browser agents in mid-2026. The capability has crossed the threshold from “demo” to “sometimes works in production” — but the variance is enormous, the cost per task is opaque, and the failure modes are weirder than anything the model providers prepared you for. The headline benchmarks are now in the 60% range on WebArena, up from 14% two years ago; the practical pass rate on a real-world task you actually care about is much, much lower.

What’s interesting is that the market has not converged on a single “browser agent.” It’s separated into three distinct layers, and the layer that’s winning is not the one that gets headlines.

Layer 1: Manus and the “general agent” pitch

Manus, the Chinese-origin general agent that launched in invitation-only beta in March 2025, became the highest-profile example of the consumer-facing pitch: type a goal, walk away, come back to a result. The launch demo — autonomous resume screening, autonomous stock analysis — racked up a million views in twenty hours. On the GAIA benchmark, Manus reported state-of-the-art performance across all three difficulty tiers. Humans score 92% on GAIA. GPT-4 with plugins scored 15%.

By mid-2026 the picture is more complicated. Manus’s pricing structure — 1,000 starter credits free, 4,000 for $20/month Pro, up to 40,000 for $200/month — sounds generous until you discover the credit model is opaque. The same task can cost 1 credit or 5 credits depending on how many retries the agent does internally. Users on Reddit and Hacker News routinely report tasks that burn 500-900 credits before finishing, and there are no refunds for tasks that fail partway through. You paid for the failure.

Third-party analyses of Manus’s failure modes from late 2025 found three recurring patterns: hallucinated clicks (~28% of failures), where the agent thinks it clicked one element but actually clicked another; browser timeouts (~22%), where a page never finishes loading and the agent doesn’t know whether to wait or move on; and anti-bot blocks (~18%), where the site detects automation and either CAPTCHAs the agent out or quietly serves it junk content. Manus’s own documentation acknowledges that tasks with more than five dependent steps have a sharply higher failure rate — the agent loses track of where it is in the plan and either repeats steps or skips them.

The pattern is familiar: a capable model on a benchmark, a brittle agent in the wild, and a pricing model that mostly hides the brittleness from the user until they look at their bill.

Layer 2: Browserbase as infrastructure

While Manus was selling the dream, Browserbase was selling the picks and shovels. Founded in 2024, Browserbase is a straightforward infrastructure play: managed headless Chrome instances in the cloud, with the operational annoyances of running browsers at scale abstracted away. Stealth fingerprints, residential proxies, CAPTCHA solving, session replay, file upload/download — all the things you discover you need three weeks after you start running Playwright in a container.

The reason Browserbase matters strategically is that everyone above them in the stack uses them. The agent vendor selling you autonomy needs somewhere to run the browser. The startup building a niche scraper needs somewhere to run the browser. Anthropic’s Computer Use demos and OpenAI’s Operator both use sandboxed browser environments that look suspiciously like the Browserbase shape. The infrastructure layer is the layer that captures value when the layer above it is fragmented and unprofitable — which is exactly what the consumer browser-agent market looks like.

Three layers, with most of the customer-visible product at the top and most of the durable value at the bottom. Stagehand sits in the middle and is increasingly where the developer mindshare goes.

Layer 3: Stagehand and the harness that won

The interesting layer is the middle one. Browserbase’s open-source SDK, Stagehand, is a deliberate counter-proposal to both the “raw Playwright” and “general agent” extremes. It exposes four primitives — act, extract, observe, agent — and asks developers to choose, at each step of their automation, whether they want deterministic code or LLM-mediated natural language.

The framing is the load-bearing thing. Most existing browser automation tools force you into one of two regimes: write low-level Playwright code that breaks the moment a CSS selector changes, or hand the entire task to a black-box agent and pray. Stagehand says: write code for the parts you understand, write English for the parts that are fiddly or fragile, and let the harness route between the two. page.act("click the submit button") is resolved at runtime by an LLM looking at the current DOM, so when the site redesigns and the submit button gets a different class, your script still works. page.extract({ schema: PriceSchema }) returns typed data without you writing parsing code.

The Stagehand bet is that the durable interface to a browser agent is not “a goal in English” but “a structured automation with English at the fragile joints.” v3 of Stagehand, released earlier in 2026, reports completing actions 44% faster than v2, with the speedup attributed partly to better caching of resolved selectors and partly to network-level optimizations when you run on Browserbase itself. That last detail matters: Stagehand runs anywhere Playwright runs, but it runs better on Browserbase, which is the Browserbase business model in a sentence.

// Stagehand, used the way it actually ships in production
const stagehand = new Stagehand({ env: "BROWSERBASE" });
await stagehand.init();
const page = stagehand.page;

await page.goto("https://example-airline.com");
await page.act("search for flights from SFO to BER on May 30");
const offers = await page.extract({
  instruction: "list all flight options visible on the page",
  schema: z.object({ flights: z.array(FlightSchema) }),
});
// from here, plain TypeScript decides which to book

Notice what this code is not: it is not a single English prompt (“book me a flight”) handed to an autonomous agent. It is a Playwright script with two English sentences embedded at the points where deterministic selectors would have been brittle. That’s the design.

What “production” actually costs

The honest numbers on browser-agent cost per task, pulled from public pricing pages and a handful of third-party benchmarks as of May 2026:

Manus charges by opaque credits. A simple 10-minute task is typically 1–5 credits; a complex multi-step task can run 50–500 credits or more if retries happen. At $20 for 4,000 credits, that’s somewhere between $0.005 and $2.50 per task depending on luck. The variance is the story.
Browserbase charges for browser-minutes plus add-ons (proxies, captcha solving). A typical agent run is 30–120 seconds of browser time, putting the infrastructure cost at well under a cent. The model costs on top — usually $0.02 to $0.20 in tokens depending on how much page text the agent has to reason over — dominate the bill.
Self-hosted Playwright + LLM is “free” until you count engineer time, anti-bot maintenance, captcha solving (typically $1–$3 per thousand solves), residential proxy bandwidth, and the on-call burden of running Chrome at scale. Most teams who started here have migrated to Browserbase or a competitor within a year.

The cost story matters because it explains why the consumer-facing agent products struggle to find unit economics. Manus’s free tier burns money on tasks that fail; its paid tiers ship users a credit balance and a prayer. Browserbase has clean per-minute pricing and is, by all reports, profitable on a per-customer basis.

Failure modes the model doesn’t fix

The thing that took the field a while to internalise is that browser agents fail in ways that have very little to do with the model’s reasoning ability. The Anthropic Computer Use harness docs categorise the failures into planning errors (model misreads the task), action errors (model picks the right action but executes it wrong), and critic errors (model misjudges whether the action succeeded). In production, the bulk of the cost goes to action errors and critic errors, neither of which gets fixed by a smarter model.

Concrete examples that show up in every team’s incident logs:

The DOM moved between the screenshot and the click. The agent took a screenshot, reasoned about it, decided to click coordinate (412, 318), and by the time the click fired, an ad slot loaded and pushed the target button down 40 pixels. The agent clicked the ad instead. Stagehand’s selector-resolved act is partly a response to this: it resolves the selector at click time, not at planning time.
Login walls and 2FA. A surprising amount of “useful” web automation hits an account boundary fast, and the model has no good story for “the human needs to enter their 2FA code into their phone right now.” The best practice that has emerged is to put a human-in-the-loop checkpoint at any auth boundary and accept the latency cost.
Anti-bot detection. Most major sites (Amazon, LinkedIn, airline booking engines, Google itself) have moved beyond IP rate-limiting to full browser fingerprinting — they look at canvas rendering, font enumeration, timezone, mouse-movement entropy, the works. A vanilla headless Chrome is detected in seconds. Browserbase’s value-add here is the maintained stealth fingerprints; this is a moat that compounds because every detection bypass becomes a new detection signal for the next round.
CAPTCHAs. Roughly 70% of high-value sites gate at least some flows behind CAPTCHAs. Anti-CAPTCHA services exist, but using them is legally fraught for some categories of customers (anything regulated) and operationally expensive for the rest.
Dynamic content that the agent doesn’t recognize as “the same thing in a new shape.” A product listing that adds a “sale” badge is the same listing; the agent doesn’t always agree. This is the source of the most expensive bug class — the agent does the wrong thing confidently, and there’s no test framework to catch it before money changes hands.

Where browser agents actually work today

After all of that, browser agents do work in production, in narrowly scoped places. The pattern that ships:

Internal automations on known sites. A startup uses a browser agent to scrape its own admin dashboards because the official API is incomplete. The sites are stable, the tasks are repetitive, the failure cost is low, and the team can iterate on the prompt every time the site changes. Stagehand is the dominant choice here because the script is checked into git and reviewed like code.

Lead enrichment and prospecting. An SDR tool needs to look up 2,000 companies on LinkedIn and pull headcount, funding stage, and tech stack. The volume justifies the engineering work; the cost per lookup needs to be a few cents; failures are recoverable (retry later). Browserbase is the dominant infrastructure choice.

Form-filling and one-off booking. A consumer agent (Manus, ChatGPT Agent) gets asked to book a haircut or fill out a passport renewal form. The cost per task is high relative to value, but the user values the absence of work more than the marginal cost. This is the consumer-agent thesis, and it’s the area with the most variance — some people swear by it, others have a story about a $30 burn on a failed booking.

Vertical agents with deep domain knowledge. Some of the most reliable browser-agent deployments are vertical products — a tax-form-filing agent, a regulatory-compliance scraper, a competitive-pricing monitor — where the team has wrapped a general agent in a thick layer of domain-specific verification logic. The browser agent is one component in a larger pipeline that checks its work.

The pattern that ships sits on the left. The pattern that gets headlines sits on the right.

What to take away

The model is not the bottleneck — the harness is. Most failures are action errors and critic errors, neither of which a smarter model meaningfully fixes. The teams shipping reliable browser agents have invested in selector resolution, schema-validated extraction, and recovery loops, not in chasing the latest frontier model.
The infrastructure layer captures the value when the agent layer is fragmented. Browserbase wins by being the boring layer everyone needs. Manus wins or loses on consumer adoption math that hasn’t resolved yet. Pick which game you’re playing.
English at the fragile joints, code everywhere else. The Stagehand bet — that the durable interface is structured code with English at the parts that change — is the one converging with how production teams actually build. The “describe your goal and walk away” pitch is great for demos and bad for SLAs.
Cost transparency is the next competitive battleground. Opaque credit systems lose to per-minute billing the moment a CTO has to explain the line item to a CFO. The vendors who price predictably are pulling ahead in enterprise deals.

The “general agent that does anything in your browser” is real, sort of, sometimes. The “reliable automation that handles the part of your workflow that touches the web” is also real, and it’s the version that’s quietly becoming infrastructure.

Further reading: Browserbase’s Stagehand v3 launch post is the best technical summary of the harness layer. The WebArena paper is still the foundational benchmark and worth a fresh read with 2026 eyes. Anthropic’s harness design notes are the deepest published analysis of where computer-use agents fail and why. For the consumer side, MIT Technology Review’s Manus review captures the hype-vs-reality gap better than most.