Cloud Run is the most underrated platform for AI agents

The agent infrastructure conversation in 2024 was loud about two things: LangGraph versus LangChain, and “which model is best.” It was almost silent about where the thing actually runs. The default answer that emerged — “throw it on GKE” or “stick it behind a Fargate task” — was driven by habit, not by the shape of agent workloads.

If you sketch out what an agent actually does, it looks nothing like a web service. It receives a request. It calls a model — sometimes for seconds, sometimes for minutes. It calls tools. It calls the model again. Then it goes quiet for hours until the next user shows up. Burn-then-idle. Spiky. Long per-request. Mostly waiting on someone else’s network.

That is exactly the workload Cloud Run was built for, and after the 2024 platform updates — startup CPU boost going GA, the memory ceiling lifting to 32 GiB, request timeout extending to 60 minutes, GPU support landing — it quietly became the cheapest, simplest way to put a real agent in front of users. The teams I see shipping the fastest agents in 2026 are largely on this stack. They just don’t talk about it, because “we deploy to Cloud Run” doesn’t sell conference talks.

Why “stateless service” is the right shape for an agent

The first thing to internalise is that a well-designed agent has no in-process state worth preserving. The plan, the conversation history, the intermediate tool results — all of it should already live in an external store, because otherwise a single instance restart wipes the user’s session. The moment you accept that, the deployment question collapses. You no longer need pods that “stay warm” or sticky load-balancing. You need a function-shaped runtime that wakes up, reads state, does work, writes state, and goes back to sleep.

The Cloud Run agent topology. The service itself holds nothing — every piece of durable context lives in Firestore. Replace the instance at any time and the user’s session is unaffected.

This shape is the whole reason the rest of the post works. If your agent insists on multi-hour in-memory orchestration, none of this applies; you want a long-running pod somewhere. If your agent is a loop that reads state, advances the plan one step, and writes state, you want Cloud Run.

What changed in 2024 that made this viable

Cloud Run circa 2022 was a perfectly fine place to run a small Flask app and a hostile place to run an LLM workload. Four product changes flipped that.

Startup CPU boost went GA. Before this, a cold Cloud Run instance got its allocated CPU and that was it; loading a Python process with google-cloud-aiplatform imported took 3–6 seconds on a 1-vCPU service. With boost enabled, the platform temporarily doubles your allocated CPU during the boot phase, dropping cold starts on a typical agent container to well under a second. Google’s own documentation makes this a one-flag change; almost nobody flips it because almost nobody knows.

The memory ceiling moved to 32 GiB. Long agent loops that build up context windows over many turns, plus moderate-sized embedding caches, plus a few PDFs in flight — easily 8–12 GiB. The old 8 GiB cap was a hard wall. The new ceiling means most real agents fit, including ones that hold a vector index in process for a single user’s session.

Request timeout went to 60 minutes. This sounds boring. It is not. A 60-minute timeout means a deep research agent, a long codegen run, or a multi-step data analysis can complete on a single Cloud Run invocation without writing your own retry / resume logic. (You still want to checkpoint, but you don’t have to.)

Concurrency-per-instance is yours to set. Cloud Run lets you say “one instance handles up to N concurrent requests.” For a synchronous web app, you crank N high; for an agent that pegs CPU for its whole lifetime, you set N to 1 and let the platform spin up more instances. Both modes work. The cost model just bills you for actually-used CPU-seconds either way.

Identity is the part that should have been the headline

The least-discussed advantage of the GCP stack is that the agent service running on Cloud Run can call Vertex AI without ever holding an API key. The service runs as a service account; that service account is granted the roles/aiplatform.user role; the Google client libraries automatically pick up the credentials from the metadata server. No GOOGLE_API_KEY in environment. No Secret Manager round-trip on every call. No “we leaked our key on GitHub” incident.

The same identity flows to Firestore (roles/datastore.user), BigQuery (roles/bigquery.user), and any other GCP-native tool the agent calls. The blast radius of a compromised container is exactly what the service account can do, and that’s auditable in Cloud Audit Logs.

Compare that to the typical AWS agent deployment: a Bedrock API key mounted from Secrets Manager, retrieved on cold start, cached in memory, refreshed on a timer, and — depending on the team — sometimes just shoved into an environment variable because the rotation logic broke and nobody noticed. The Cloud Run + Vertex story is structurally better, not just superficially more convenient.

State, in detail: why Firestore is the right partner

The session-state question for agents has three real options on GCP: Firestore, Memorystore (Redis), and Cloud SQL. The case for Firestore in this stack is specific.

Document shape matches plan state. An agent’s plan is a tree of tasks with results and timestamps. That’s a JSON document. Putting it in a relational table is gymnastics; putting it in Redis loses durability; putting it in Firestore is one set(planDoc) call.
Real-time listeners are free. The web UI showing the user “the agent is now doing step 3 of 7” can subscribe to the Firestore document and get pushed updates. No websocket server in your Cloud Run container. The browser talks to Firestore directly with a scoped token; Cloud Run just writes the plan document.
Single-document atomicity is enough. Agent state mutations are single-session and single-document. You do not need multi-row transactions across users. Firestore’s per-document atomicity is exactly the contract you want.
The pricing is honest for spiky workloads. Reads and writes are billed per operation, not per provisioned capacity. A user who hasn’t logged in for two weeks costs you exactly zero Firestore dollars during those two weeks.

The trap to avoid: do not store the full message history in Firestore as a single growing document. Firestore documents have a 1 MiB limit, and conversation histories blow past that fast. Either chunk turns into a subcollection or push the long-tail to Cloud Storage with a manifest in Firestore.

The honest comparison vs container-on-GKE

Cloud Run is not universally better than GKE. There is a clean line where each wins.

A reductive but useful split. The default in 2026 should be Cloud Run; reach for GKE only when one of the right-hand rows applies.

The mistake teams make is choosing GKE prophylactically — “we might need it later, so let’s start there.” The cost of that decision is a permanent tax on every deploy, every secret rotation, every node upgrade. If you graduate from Cloud Run to GKE later, the agent itself ports trivially because it was already stateless. You lose nothing by starting simple.

What this looks like in practice

The reference deployment, stripped to the structural moves:

One Cloud Run service per agent. Memory: 4–8 GiB. CPU: 2. Concurrency: 1 (long per-request CPU work). Min instances: 0. Max instances: tuned to your traffic ceiling.
Startup CPU boost on. Health-check endpoint returns 200 as soon as the import phase finishes, not after some heavy warmup.
Service account scoped to exactly what the agent needs. aiplatform.user, datastore.user, the specific BigQuery dataset. Nothing more.
Firestore in Native mode, one document per session, subcollection for the message turns once the session crosses ~100 KB.
Cloud Tasks for any “kick off a long agent run and tell the user later” pattern. A user request creates a Cloud Task; the task worker is the Cloud Run service in a different endpoint; the endpoint runs the agent and writes the result back to Firestore. The user UI subscribes to the document and sees the result land.
Cloud Logging + Cloud Trace already wired up. Every agent step emits a structured log; the model call is a span; the tool call is a span. You get the agent’s plan replayable in the Cloud Trace UI without writing any observability code.

That stack handles thousands of concurrent users on a configuration that costs less than one always-on n2-standard-4 VM when idle. It scales to zero overnight. It rolls out new versions with traffic splitting built into the platform. It does not require a Kubernetes engineer.

What to take away

Three lines, the same shape as last time:

The default agent deployment in 2026 should be Cloud Run + Vertex AI + Firestore. Treat anything else as needing justification.
The “stateless agent service” is the right abstraction. Push every piece of durable context into Firestore (or, for big blobs, Cloud Storage with a Firestore manifest). The compute layer should be replaceable at any time.
The right time to escalate to GKE is when a measurement says you must — a multi-hour workload, a custom GPU need, a sidecar pattern. Not “we might need it.” The migration path is trivial because the agent is already stateless.

The Cloud Run + Gemini stack does not get the airtime that Bedrock or the open-source agent frameworks get. That is partly because Google’s marketing has been mediocre and partly because “we deploy a container” is not a thrilling headline. The engineers who notice the difference are the ones whose AWS bills doubled when they tried to run the same agent on a permanently-warm Fargate task — and then halved when they moved it to Cloud Run.

Further reading: Google’s Cloud Run for AI inference overview covers the GPU support that landed in 2024. The Vertex AI agent builder docs are the official path for agents that want managed orchestration on top of this stack. For session-state patterns, see Firestore’s data modeling guide.