datarekha
Agents May 9, 2026

Deep research agents explained: Perplexity, GPT Deep Research, Gemini Deep Research

Deep Research was OpenAI's first agent product to actually ship, and it landed in a market where Google had already been live for months and Perplexity had built the search-citation pipeline the others were chasing. The three converged on similar capabilities but radically different architectures.

13 min read · by datarekha · deep-researchperplexitygeminiopenai

In February 2025, OpenAI shipped its first product that anyone would unambiguously call “an agent.” Deep Research — a feature that takes a natural-language research question and returns, ten to thirty minutes later, a multi-thousand-word report with inline citations from across the open web. It was OpenAI’s “ChatGPT moment” for autonomous agents: the first time the public touched a system that planned, browsed, synthesized, and produced finished work without intermediate human steering.

The product was not, however, the first of its kind. Google’s Gemini Deep Research had shipped two months earlier, in December 2024, with the same basic loop. Perplexity’s Deep Research followed shortly after. Eighteen months later, all three are live, all three have evolved substantially, and the architectural divergences between them have become more interesting than the surface similarities.

This post compares the three on what actually distinguishes them: how they plan a research task, how they handle citations, and where each one wins in practice.

A bit of history

The “agent that does research” framing isn’t new. Microsoft had a research-assistant prototype in late 2023; AutoGPT and BabyAGI in 2023 were early experimental versions of the loop (with very different results). What changed was the underlying model quality plus the engineering investment to make these reliable enough to ship as commercial products. December 2024 (Gemini Deep Research’s launch) marks the first time anything in this space crossed into “consumers actually use it daily” territory.

OpenAI’s launch in February 2025 was the second crossing — the moment when the broader public realized agentic AI products could ship as polished consumer experiences. The interesting historical detail is that OpenAI was second to market, not first. Google’s December 2024 launch with Gemini 1.5 Pro arrived earlier and worked similarly well; OpenAI’s brand and distribution made it the product everyone heard about. Deep Research as a category became “the OpenAI thing” in the popular imagination even though Google’s product predated it.

What “deep research” actually is

The product category is now well-defined. A deep research agent:

  1. Takes an open-ended research question.
  2. Plans a multi-step investigation — typically decomposing the question into sub-queries.
  3. Browses the web (and, increasingly, private data sources) for sources matching each sub-query.
  4. Synthesizes findings across dozens to hundreds of sources.
  5. Returns a long-form report with inline citations.

The whole thing takes 5 to 30 minutes. The output is in the same shape as a junior research analyst’s first draft. The competition is on report quality, citation reliability, and what data sources can be reached.

DEEP RESEARCH — THE THREE ARCHITECTURESPERPLEXITYretrieval-firstquery planhybrid retrieval(BM25 + vector)cite-at-retrieval+ LLM synthesizeOPENAItool-calling agentGPT-5.2 orchestratorbrowser + MCPconnectorsiterative refineinterrupt-ableGEMINItiered + asyncStandard (low lat)vs Max (deep)Interactions APIbackground=trueMCP private data
All three return long-form cited reports, but the architectures reflect very different bets. Perplexity is a search company that added LLMs; OpenAI is an LLM company that added browsing; Google is splitting interactive and async into two products.

Why the architecture matters

These three architectural choices have real consequences for users. A retrieval-first system like Perplexity’s can show you exactly which documents contributed to which claims — the citation lineage is explicit because citations were attached before generation. An agent-with-browsing system like OpenAI’s hides this; the model decided what to cite as part of writing the prose, and you have to trust that the cited URL actually says what the model claims.

The tiered system Google chose is a different bet — that the same architecture should be configurable for “fast and shallow” versus “slow and thorough” use cases. The implication is that customers want both modes from the same product, which is empirically true for many enterprise workflows.

No architecture is strictly better. Perplexity’s transparency comes at the cost of less narrative coherence. OpenAI’s polish comes at the cost of opacity. Google’s flexibility comes at the cost of complexity in the API. For builders, the architectural fit to the use case matters more than the headline benchmark.

Perplexity — retrieval was the moat

Perplexity arrived at deep research from a different angle than OpenAI and Google. They had already spent years building a tightly-coupled retrieval-ranking-generation pipeline for their core search product. Deep Research was a natural extension: instead of doing one round of search-rank-generate, do dozens, build a multi-step plan, and produce a long-form output.

The architectural distinction that matters: Perplexity embeds citations at retrieval time, not at generation time. Their orchestration engine assembles the prompt with citation markers, source metadata (URLs, publication dates), and ranked document excerpts already inline before the LLM ever runs. The LLM then writes against that pre-structured context, and the citations are extracted from the markers — not generated by the model. The model is not allowed to invent a URL because it never had to.

This is the opposite of how the early ChatGPT browsing tool worked, where the LLM was asked to remember which fact came from which page and produce citations as part of the output. That approach generates hallucinated URLs at a non-trivial rate; Perplexity’s approach makes hallucinated URLs structurally impossible.

For their Sonar Deep Research product:

  • A standard Pro Search retrieves 60+ sources per query.
  • Deep Research formulates a multi-turn search plan, breaking complex queries into separate searches, and reads hundreds of sources.
  • Retrieval is hybrid — BM25 lexical search combined with semantic vector embeddings.
  • The orchestration engine ranks documents per sub-query, embeds the highest-ranked excerpts directly into the LLM context with explicit citation tags, and asks the LLM to synthesize.

Where Perplexity wins: when source attribution and reliability matter more than narrative polish. The citation accuracy is consistently better than the alternatives in independent reviews. For journalism, fact-checking, and any domain where readers will actually click the citations, this matters a lot.

Where it loses: the output style is closer to a bibliography with prose than a cohesive narrative report. If you want a report that reads like a McKinsey deliverable, Perplexity isn’t it.

OpenAI — tool-calling on a frontier model

OpenAI’s Deep Research is architecturally simpler in concept: take GPT-5.2 (the current research-tuned variant), give it a web browser, a code interpreter, and an expanding set of connectors, and let it run an orchestration loop for ten to thirty minutes. The model itself plans, browses, reads, and synthesizes. There’s no separate retrieval-ranking system in front of it.

The trick is that the LLM is good enough to handle the whole loop. GPT-5.2 maintains coherent multi-step plans, recovers from dead-end searches, and produces narrative output that reads like a human-authored report. The trade is that citations are model-generated and therefore subject to the usual hallucination risks — though OpenAI has invested heavily in reducing these via tool-result grounding.

The February 2026 release added two important capabilities:

  • MCP connectors — Deep Research can now reach Google Drive, SharePoint, financial databases, and any MCP-exposed source. This is where the enterprise pitch lives: equity research analysts can point Deep Research at their internal model library plus the open web in a single query.
  • Interruption and refinement — users can intervene during execution to redirect the research. Mid-run, “actually focus on the European market” is now a supported action. This is the OpenAI answer to the “long-horizon goal drift” problem: let the human re-ground when the model strays.

The product was, by OpenAI’s own positioning, the on-ramp to their broader ChatGPT Agent product. ChatGPT Agent (launched July 2025) generalized the Deep Research loop into a more general autonomous workflow runner with a virtual computer. Deep Research lives within ChatGPT Agent as one of its capabilities now.

Where OpenAI wins: report quality and narrative coherence. For a one-off briefing memo, an OpenAI Deep Research output usually reads the most like a polished human deliverable.

Where it loses: opacity into the research process. Perplexity shows you its search trail; OpenAI shows a heavily-edited highlight reel. For users who want to inspect why the model concluded X, the transparency gap is real.

Why OpenAI Deep Research feels different

A user who’s used both Perplexity and OpenAI Deep Research notices the difference immediately. OpenAI’s output reads more like an essay; Perplexity’s reads more like an annotated source list. The reason is partly stylistic but largely architectural: GPT-5.2 doing the synthesis is operating on summarized web content it browsed, with the model controlling the narrative flow. Perplexity’s synthesis is structured around the pre-ranked source list, with citations as load-bearing elements.

This matters for downstream use. If the user is going to paste the output into a deliverable, OpenAI’s narrative-heavy style needs less editing. If the user is going to fact-check before quoting, Perplexity’s structured citation-first style is easier to verify. Different products for different next-step actions.

The interruption feature OpenAI added in February 2026 is a nice ergonomic touch — users can refine the research direction mid-flight without restarting. Perplexity and Gemini don’t yet have an equivalent. For longer research runs, this matters: realizing 10 minutes in that you wanted a slightly different angle is more common than you’d think, and being able to redirect saves the cost of starting over.

Gemini — splitting Standard from Max

Google’s Gemini Deep Research was first to market in December 2024, initially powered by Gemini 1.5 Pro. The product evolved through 2025 alongside the Gemini model line. The defining architectural decision came in late 2025 with Deep Research Max — splitting the product into two tiers explicitly trading latency for thoroughness.

  • Deep Research (Standard) — Optimized for low-latency, interactive use. Sub-five-minute typical runs. Powered by Gemini 3.1 Pro.
  • Deep Research Max — Uses extended test-time compute, iteratively reasoning and searching for longer. Designed for asynchronous background workflows where the user is willing to wait 20-30+ minutes for a higher-fidelity output.

This tiering, Google argues, is a “step change for autonomous research agents” — the recognition that the same product needs to serve “I want a quick brief now” and “I want the best possible deliverable in an hour” with different model deployments.

The December 2025 release of the Gemini Deep Research Agent API made the agent programmable for the first time. The Interactions API uses a polling-based interface with a required background=true parameter for long-running calls. This is the first time a frontier lab has shipped a deep-research capability as a developer primitive rather than a consumer feature — enterprises can now build their own UIs and workflows on top.

MCP support is first-class. Like OpenAI, Gemini Deep Research can connect to private databases, internal document repositories, and third-party services through MCP servers — but with Google’s emphasis on enterprise data sovereignty (data doesn’t leave the customer environment).

Where Gemini wins: the Standard/Max tier split is genuinely useful — fast briefs in minutes, deep dives when needed, same product. And the developer API is the most mature for building research-powered features into other products.

Where it loses: report style is the most “AI-flavored” of the three — readers familiar with Gemini’s prose voice will recognize it immediately. Some find it overly hedged.

Why Gemini’s API matters

Worth highlighting that Gemini was first to make Deep Research a developer primitive via the Interactions API. Until that release, deep research was a feature of consumer products (Perplexity Pro, ChatGPT Plus, Gemini app) but not something you could programmatically embed in your own application. The Interactions API changed that.

For builders, this is significant. You can now offer a “deep research” capability in your product without building the research pipeline yourself — call the Gemini API, pass a question, poll for completion, get a structured report back. The Interactions API specifically introduces a background=true parameter for long-running calls, with polling for progress. This is a cleaner shape than trying to fit a 30-minute research run into a normal request-response cycle.

The implication is that “deep research as a feature in other products” becomes plausible. Expect to see deep research show up as an embedded capability in CRM tools, market intelligence platforms, legal research products, and enterprise wikis through 2026 and 2027. The API made the previously product-shaped capability into an ingredient.

The connector race

The most consequential 2025-2026 development across all three products is MCP support for private data sources. Until the Model Context Protocol became a de facto standard in mid-2025, deep research was effectively limited to the open web. Some products had bespoke connectors (Perplexity to specific finance feeds, Gemini to Google Workspace) but each was a custom integration.

MCP changed the economics. An enterprise that exposes its document repository, internal wiki, finance database, and CRM through MCP servers gets all three deep-research products at once. The customer’s data doesn’t leave their environment; the agents query through the MCP layer with the customer’s permissions enforced server-side.

This is the unlock for enterprise deep research. Equity research at a hedge fund doesn’t only need the open web — it needs the firm’s proprietary models, prior research notes, and earnings transcript archive. Legal research at a law firm needs the firm’s prior briefs and document repository. With MCP, deep research products become genuinely useful for these workflows for the first time.

OpenAI’s February 2026 release explicitly mentions Google Drive, SharePoint, and financial databases as supported connectors. Gemini’s Deep Research Max emphasizes MCP as the integration story. Perplexity’s enterprise tier ships with similar capabilities. The product space has effectively standardized on MCP as the enterprise integration protocol — a notable counterpoint to the usual lab competition.

The same question to all three

A useful test: ask all three the same equity research question. “What are the key competitive risks to NVIDIA’s data center business over the next 3 years, with quantitative support?”

In our testing across these products in early 2026, the patterns hold consistently:

  • Perplexity returns 80-120 citations spanning analyst reports, SEC filings, news articles, and technical blogs. The output reads like an annotated literature review — heavy on quotes, careful attribution, modest narrative integration. Best for the user who will read the cited sources themselves.

  • OpenAI returns 40-60 citations, weighted toward the highest-quality sources, in a tightly-narrated report that reads like a research note from a sell-side analyst. Two or three of the cited URLs will be subtly mis-described — not hallucinated but with a paraphrased title that doesn’t quite match the linked page. Best for an executive briefing.

  • Gemini Max returns 100-150 citations with the most extensive use of charts and visualizations (the API supports inline chart generation). The narrative is more cautious than OpenAI’s, with more “however” and “on the other hand” hedges. Best for a long-form analytical document.

None of the three is dominant on all axes. The choice depends on what the deliverable is for.

Where each product is heading

A brief look at the trajectory. Perplexity continues to invest in citation accuracy and source quality — their bet is that as the products converge on capability, trustworthiness becomes the differentiator. They’ve been adding source-quality signals (peer-reviewed-versus-blog, primary-versus-aggregated, recency weighting) to their ranking pipeline.

OpenAI is positioning Deep Research as one capability inside the broader ChatGPT Agent product. The strategic move is to make research one of many things the agent does, rather than a standalone feature. Long-term, expect the boundaries between “deep research” and “general autonomous task” to blur on OpenAI’s side.

Google’s bet is on the API and enterprise integrations. Deep Research Max plus the Interactions API plus MCP connectors plus Workspace integration suggests Google is positioning for the enterprise research-platform use case — the place where businesses run hundreds of research queries a day against a combination of public and private data.

These three trajectories are diverging more than converging. By 2027, expect the products to be quite different from each other rather than slowly merging into a single shape.

The latency-thoroughness tradeoff

The split that Google formalized between Deep Research Standard and Max captures a tension that all three products are wrestling with. Researchers want both “fast brief in 5 minutes” and “thorough deliverable in 30 minutes” — and these are very different products under the hood.

The fast variant is essentially a single-pass agent: plan, execute 10-20 search queries in parallel, synthesize. The thorough variant is iterative: plan, execute, re-plan based on findings, execute more, refine. Each iteration adds depth but multiplies wall-clock time.

OpenAI’s product handles this through user choice — the user can interrupt and refine mid-run, effectively bending the agent toward more or less depth on the fly. Perplexity handles it by exposing different products (Pro Search for fast, Deep Research for thorough). Google’s tiered Standard/Max is the most explicit, with the choice made up-front and the model deployment changed accordingly.

The architectural lesson is that deep research isn’t one product, it’s at least two. The same backend running for 5 minutes vs. 30 minutes is doing qualitatively different work — not just more of the same. The successful products acknowledge this rather than pretending one knob fits all uses.

Time-to-first-token vs. time-to-final-report

A useful framing for thinking about user experience in deep research products. Most user-facing AI products optimize for TTFT (time to first token) — how quickly the response starts. Deep research products optimize for TTFR (time to final report) — how quickly the complete deliverable is ready.

These metrics pull in different directions. TTFT optimization favors quick, surface-level answers. TTFR optimization favors thorough multi-step investigation. The products handle the tension differently:

  • Perplexity streams partial results as they’re synthesized, giving the user something to read at intervals even while the full report is still being produced. TTFT is fast; TTFR is slower.
  • OpenAI’s Deep Research shows progress (a list of sub-questions being investigated) but the full report drops at the end. TTFT-as-real-content is slow; the user sees activity but not content during the wait.
  • Gemini Deep Research Max is the most async — it’s explicitly designed for “fire and forget” with a notification when complete. TTFT isn’t optimized at all; users are expected to come back later.

Different optimizations for different use cases. For interactive research where the user wants to refine, Perplexity’s streaming is ideal. For overnight batch research, Gemini Max is ideal. For producing a polished briefing document for tomorrow’s meeting, OpenAI’s silent-wait-then-finished-report is fine.

The business story — equity research

The most economically consequential use case for deep research agents is not the consumer browsing experience — it’s professional research workflows. Sell-side equity research analysts, management consultants, and law firm associates are the population whose entry-level work most directly resembles what these tools produce.

A senior research analyst at a mid-sized hedge fund described their workflow shift in early 2026: “I used to staff three junior analysts to produce the kind of preliminary research a deep research agent now produces in 30 minutes. I still need the juniors — but they spend their time fact-checking and extending the agent’s output, not generating the first draft. The headcount math is changing.”

The displacement isn’t total — the agents make mistakes, miss context, and lack the firm-specific judgment that experienced analysts bring. But the first draft is real work, and that’s now mostly automated. Consulting firms have noticed; internal McKinsey memos leaked in late 2025 (the same playbook is being run at BCG, Bain, and the Big Four) described systematic experiments to integrate Deep Research-class tools into the standard associate workflow.

The implications, in 2026:

  • Junior research roles are being displaced first, not from layoffs but from non-replacement. New analyst classes are smaller, and the work shifted to fact-checking and extension.
  • The market value of “domain plus AI fluency” has risen sharply. Researchers who use these tools well produce 3-5x the throughput of those who don’t.
  • The output expectations have shifted upward. What was a Monday morning briefing memo in 2023 is now a Sunday night Deep Research run, with the human’s value-add being the analysis on top.

What goes wrong with deep research

The products are real but not infallible. The failure modes are consistent across the three:

Citation drift in the body. All three models will sometimes write a sentence that’s almost what the cited source says, but with the nuance shifted. The model paraphrases, and the paraphrase introduces a subtle error not present in the source. Perplexity’s retrieval-first architecture reduces this but doesn’t eliminate it; OpenAI’s narrative-heavy style is most prone to it.

Stale sources presented as current. A 2023 article describing the state of an industry may be cited alongside 2025 sources without the temporal context made obvious. The model treats all retrieved text as equally valid. For market intelligence work, this can produce conclusions that were true a year ago but no longer are.

Phantom consensus. When the model finds three sources saying X and one saying Y, it tends to report “consensus around X” — even when the dissenting source is more credible. This is a known weakness in RAG-style synthesis and shows up in deep research outputs regularly. Quality reviewers note that the agent’s notion of “consensus” is essentially counting articles, not weighting them.

The 80% answer. All three products produce an answer that’s 80% correct, 95% of the time. The remaining 5% — wholesale wrong, plausibly written — is the dangerous output. Users who accept the output uncritically eventually ship something that’s wrong in a way that matters. The teams using these tools well treat the output as a draft, not a deliverable.

The implication is that deep research is a productivity tool for someone who already knows the domain, not a substitute for domain expertise. The expert can scan the output, spot the mistakes, and integrate the rest. The novice cannot, and is the more likely to ship the 5% wrong answer with confidence.

Pricing as a signal

A note on pricing as a window into how each company views the product. Perplexity bundles Deep Research into their $20/month Pro subscription with a per-month usage limit. OpenAI bundles it similarly into ChatGPT Plus and Pro tiers. Google’s API pricing for Deep Research is more transparent — per-token billing on the underlying Gemini calls, plus per-search costs for the search tool calls.

The strategic positioning differs accordingly. Perplexity and OpenAI are treating Deep Research as a consumer subscription feature; Google is treating it as a developer platform primitive. Both can be right — they’re different products targeting different markets — but the implications are different for builders.

If you’re building a deep-research-powered feature into a B2B product, Google’s API-first approach is the cleanest match. If you’re trying to give your users a deep-research experience via your app’s existing chat UI, Perplexity’s developer API (newer and more limited) or OpenAI’s Assistants API are alternatives. The pricing per research run varies but is generally in the $1-$5 range for the depth typical research jobs achieve.

Why this category emerged when it did

A historical observation: deep research products needed three things to be possible. First, base models capable of multi-step planning without immediately drifting (broadly, mid-2024 and later). Second, reasonably-priced tool-calling overhead (the cost of executing dozens of search-and-read calls per research run). Third, the engineering investment to build the orchestration loops and citation infrastructure.

All three converged in late 2024, which is why Google shipped first that December and the rest followed within months. The convergence wasn’t coincidental — every lab was watching every other lab’s model and infrastructure progress, and the moment the components were in place, the product became obvious.

Caching and deep research

A specific operational angle: deep research workflows benefit substantially from prompt caching (covered in our caching post). The system prompt and tool definitions are stable across runs; only the user’s research question varies. With Anthropic-style caching on Gemini’s MCP wrapper or OpenAI’s automatic caching, the per-run input costs drop sharply once a system has warmed.

For high-volume Deep Research deployments — a market intelligence platform running hundreds of queries a day, an equity research firm running a research run per ticker per week — caching is the difference between viable unit economics and uneconomic operations. The labs offering these products have all built caching into their stacks; third-party builders should make sure they’re getting the savings too.

What deep research won’t do

A few things that deep research products in 2026 are conspicuously not good at:

  • Highly current events. Search indexes lag the live web by hours; the products work best for questions where the answer doesn’t depend on the last 30 minutes of news. Live financial markets, breaking news, and rapidly-evolving controversies are weak spots.
  • Niche specialist knowledge. Anything that doesn’t have a reasonable presence on the open web — proprietary research, internal company knowledge without MCP exposure, ancient or obscure sources — is essentially invisible to deep research.
  • Genuinely novel synthesis. The products synthesize sources well; they don’t generate genuinely new analysis. If the answer requires combining ideas in a way no source has, the model will gesture at the synthesis but won’t fully execute it.
  • Quantitative depth. Asking deep research for “produce a financial model” or “calculate the projected ROI” returns surface-level numerical reasoning. The numerical analysis tools needed for quantitative depth aren’t built into the research pipelines.

These limitations are durable, not transitional. They flow from the nature of the products — search-and-synthesize over public corpora — and aren’t going to be solved by larger base models. For workflows that need these capabilities, deep research is an input to the analyst’s work, not the deliverable.

Deep research as a compute story

Worth noting that deep research is the first AI product category where the inference-compute cost per output is genuinely high. A typical Deep Research run consumes the model output equivalent of dozens of normal chat sessions — hundreds of thousands of tokens consumed, plus the search and tool-use overhead. The per-run economics are very different from “user asks question, model answers.”

This has knock-on effects. The frontier labs can offer Deep Research as a feature of their consumer products only because they can subsidize the compute cost out of subscription revenue. Standalone third-party Deep Research products would struggle with the unit economics unless they charge more — which is partly why most consumer-facing deep research products are bundled into broader subscriptions rather than sold per-run.

For builders, the implication is that running thousands of Deep Research calls is expensive — not just in API spend but in the underlying compute. The labs are unlikely to make this dramatically cheaper short-term; the cost is the cost. Products built on Deep Research APIs need to account for this in their own pricing.

The depth-vs-breadth tradeoff

A useful framing for thinking about what makes a deep research run “deep.” There are two axes:

  • Breadth: how many distinct sources are consulted. Perplexity Deep Research can read hundreds; OpenAI typically dozens; Gemini Max varies based on the run.
  • Depth: how thoroughly each source is read. Some products skim; others read carefully; the difference matters for nuanced questions.

A bibliographic survey wants breadth. An expert opinion synthesis wants depth. No single product is optimal for both. Perplexity tilts heavily toward breadth (citation-rich, source-heavy); OpenAI tilts toward depth on the sources it does read; Gemini Max is most configurable but at the cost of higher latency and complexity.

For builders building on top of these APIs, the breadth-depth dimension is the next major design choice after picking a base product. The Gemini API exposes this directly through different agent variants; OpenAI’s Deep Research has it implicitly via the model’s choices; Perplexity’s API gives you less control. Match the choice to the use case: legal research and journalism want depth, market intelligence and literature review want breadth.

How the products handle conflicting sources

A research question often surfaces sources that disagree. How each product handles this is one of the more telling architectural differences.

Perplexity typically flags the disagreement explicitly, with citations on both sides and language like “sources differ on…” or “X argues, while Y contends…” The retrieval-first architecture surfaces the disagreement at the document level, and the model is structurally encouraged to acknowledge it.

OpenAI tends to synthesize toward a consensus position even when sources are split. The narrative-first style pushes the model toward a coherent storyline; dissenting sources sometimes get a brief mention but rarely full equal-weight treatment.

Gemini sits between the two. The output often acknowledges disagreement but tends to take a position. The Max variant is more thorough about presenting multiple sides; the Standard variant simplifies.

For users doing research where the disagreement itself is the question (legal contests, scientific debates, policy controversies), this matters a lot. Perplexity’s structural acknowledgment of disagreement is probably more honest; OpenAI’s narrative coherence is more readable. Neither is wrong — they’re different products for different uses.

Failure stories from the field

A few specifics to make the failure modes concrete. An equity research team running OpenAI Deep Research on Q1 earnings analysis received a report citing a 2024 article describing Tesla’s Q3 results as the “most recent quarter.” The article was 18 months old at the time; the model presented it as current. Caught by the analyst, but a near-miss.

A consulting firm’s Gemini Deep Research run on European telecom market dynamics returned a report that confidently cited a regulatory document. The URL was real and the document existed, but the document said roughly the opposite of what the report claimed. The model had paraphrased aggressively enough to invert the meaning.

A journalist’s Perplexity Deep Research on a contested historical event returned 90+ citations with careful attribution — and three of those citations were to a single biased source repeated across syndicated republications. The model treated each republication as independent confirmation; the actual independence of the underlying sources was much lower.

These are not theoretical failures. They’re the daily texture of working with deep research tools. Users who treat them as drafts catch the errors; users who treat them as deliverables ship the errors. The products are not yet at the “trust without verification” stage and won’t be for some time.

What this means for builders

Three takeaways for anyone building research-adjacent products:

  • Don’t roll your own deep research from scratch unless you have a deep moat. The frontier APIs (especially Gemini’s Interactions API and OpenAI’s GPT-5.2 with tool calling) are now cheap enough and good enough that you should build on top, not below.
  • Citation reliability is the differentiator at the high end. If your users are professional researchers, the Perplexity-style retrieval-first architecture has cleaner accountability than tool-calling agents that generate citations as prose. Build for the audit trail.
  • The tier-split model is the future. Standard for interactive, Max for thorough — most research products will end up with at least this dimensionality. Google was first to ship it; expect imitators.

The most striking thing about deep research as a product category is how quickly the surprise has worn off. Eighteen months ago, “the AI reads hundreds of sources and writes a report” was a magic-feeling demo. Today it’s an expected feature in any AI search product. The pace of capability normalization is part of what makes the broader agent story interesting — what looks like a moat in March is table stakes by November.


Further reading: OpenAI’s Deep Research launch post, Perplexity’s Deep Research announcement, Google’s Deep Research Max blog, and the Gemini Deep Research API docs. For an architectural deep-dive on Perplexity’s pipeline, see this writeup. For our broader thinking on the agent patterns that underpin these products, see the five patterns post.

Skip to content