datarekha
Frameworks May 8, 2026

LangGraph is what you reach for when the agent has to remember

By mid-2026 LangGraph is the default orchestration layer for stateful, long-running, human-in-the-loop agents. Not because of the API — because of the checkpoints, the interrupts, and the fact that the state outlives the request.

11 min read · by datarekha · langgraphagentsorchestrationstateproduction

There is a moment in every serious agent project where the team realises they’ve outgrown a chat loop. The realisation usually arrives the same way: an agent is halfway through a 40-step refactor, the worker process dies, and the user — who had been watching a progress spinner for nine minutes — opens the support ticket. The agent had no idea where it was. The state lived in RAM. RAM is now gone.

By the time you’ve debugged that incident, you have ten new requirements, and they all rhyme with the word durable. The agent must persist its plan. It must resume where it left off. It must pause for a human to approve the SQL it’s about to run on production. It must let an operator edit that SQL and continue. It must do all of this across services, languages, and process restarts.

That set of requirements is what LangGraph solves. It is, in mid-2026, the de facto orchestration layer for any agent that lives longer than a single HTTP request — which is to say, any agent that’s actually useful.

The thing LangGraph really is

Strip away the LangChain branding and LangGraph is a small, opinionated state-machine library. You declare a typed State object. You declare nodes (functions that take state and return updates). You declare edges (static or conditional). You run the graph. The state is persisted to a checkpointer between every node.

CHECKPOINTERpostgres / sqlitecheckpoint t=1checkpoint t=2checkpoint t=3GRAPH RUNTIMEstartplanLLM nodeacttool nodeconditional edge: re-plan?end
After every node, the state object is serialised and written to a checkpointer. Any process can resume the graph from any checkpoint — that’s the whole game.

That last sentence is the whole game. A LangGraph program is not a function that runs; it is a stream of state transitions that get persisted as they happen. You can crash, redeploy, switch hosts, take a human approval over a weekend, and pick up exactly where you stopped.

Why this matters more than the API

Raw LangChain — the AgentExecutor, the runnables, the “give me a chain” syntax — was a perfectly fine prototyping toolkit. It assumed the agent ran inside a single process for the duration of one user turn. That assumption breaks the moment an agent has to:

  • Wait on an external system. A code review agent that has to wait for CI to run cannot block a worker for ten minutes. Either it gives up the worker and resumes later, or you pay for ten minutes of idle process per concurrent user.
  • Ask a human for permission. A SQL-writing agent should not run DROP TABLE without asking. “Ask a human” means the agent pauses, the request goes to a UI somewhere, hours might pass, and then the agent continues. That is not a function call.
  • Run for hours. Devin, Replit’s agent, and Cursor’s Composer all run long-horizon tasks measured in tens of minutes to hours. None of those workloads survive a single-process model.

LangGraph treats these as the normal case, not the exception. The interrupt() primitive is the cleanest example: anywhere inside a node you can call interrupt({"question": "approve this SQL?"}), the graph stops, the state is persisted, and your application receives the interrupt payload to render in the UI. When the human answers, you call graph.invoke(..., Command(resume=answer)) and the graph picks up inside that node, with the original local variables intact. That’s not a feature you bolt on — it’s a property of the state machine.

A small but real example

from typing import Annotated, TypedDict
from langgraph.graph import StateGraph, START, END
from langgraph.checkpoint.postgres import PostgresSaver
from langgraph.types import interrupt, Command

class State(TypedDict):
    question: str
    sql: str
    approved: bool
    rows: list[dict]

def write_sql(state: State) -> dict:
    sql = llm.invoke(f"Write SQL for: {state['question']}")
    return {"sql": sql}

def human_approval(state: State) -> dict:
    decision = interrupt({"sql": state["sql"]})
    return {"approved": decision == "approve"}

def run_sql(state: State) -> dict:
    if not state["approved"]:
        return {"rows": []}
    return {"rows": warehouse.query(state["sql"])}

g = StateGraph(State)
g.add_node("write_sql", write_sql)
g.add_node("human_approval", human_approval)
g.add_node("run_sql", run_sql)
g.add_edge(START, "write_sql")
g.add_edge("write_sql", "human_approval")
g.add_edge("human_approval", "run_sql")
g.add_edge("run_sql", END)

checkpointer = PostgresSaver.from_conn_string(POSTGRES_URL)
app = g.compile(checkpointer=checkpointer)

When the graph hits human_approval, it stops cold. The state is in Postgres. Your FastAPI process can shut down, redeploy, and a different worker can resume the graph two hours later when the approval comes through:

app.invoke(
    Command(resume="approve"),
    config={"configurable": {"thread_id": ticket_id}},
)

There is nothing magic happening. The checkpointer table holds the serialised State and the node cursor. The worker that resumes reads the row, restores the state, and re-enters human_approval past the interrupt point. This is the kind of thing that takes a week to build correctly on top of Celery and a real database. LangGraph treats it as a one-liner because it’s the central problem.

Who’s actually running this in production

The list is now long enough that LangChain’s team publishes it on the LangGraph case studies page. Highlights worth knowing:

  • LinkedIn built their AI recruiter agent on LangGraph for the long-running planning loop. The agent decomposes a hiring brief into searches, candidate analyses, and outreach drafts that span multiple sessions.
  • Klarna uses LangGraph for their customer service assistant, which handled two-thirds of their support volume within a month of launch.
  • Replit Agent uses LangGraph to coordinate the multi-step “build this app for me” workflow — checkpointing through plan, scaffold, edit, run, and debug stages.
  • Elastic uses it for security workflows where an analyst can intervene mid-investigation.

The pattern across all of these: the agent runs longer than a request, needs to resume across restarts, and at some point a human has to inspect or override the plan. That’s the LangGraph sweet spot. None of these teams are doing it to get fancy graph syntax — they’re doing it to get checkpoints and interrupts for free.

Where it bites

A short list of things LangGraph does not solve, lest the post sound like a sponsorship:

  • It is still Python-first. The JS port exists and is improving, but the Python ecosystem is where features land first. If your agent surface is in TypeScript, you’re either calling a Python service or living with a slightly older feature set.
  • It is not opinionated about the model. This is mostly good — you can swap Claude, GPT, Gemini, a local model. It also means prompt engineering and eval are your problem. The graph just calls whichever LLM you wire in.
  • The state object is a footgun. You will, at some point, put something un-serialisable in the state (a database connection, a file handle) and the checkpointer will throw. The fix is discipline — state must be data, not handles.
  • Debugging a long graph run is hard without LangSmith or an equivalent. The graph’s whole pitch is “we persist everything that happened” — but reading raw checkpoint rows is no fun. The tracing UI is essentially required.

These are real, but they’re the kind of pain you sign up for in exchange for not writing your own resumable workflow engine. Most teams have made that trade quietly and would not go back.

When not to use it

The honest negative case: if your agent is a single-shot Q&A box that takes a query, calls some tools, and returns an answer in under a few seconds — LangGraph is overkill. The augmented LLM pattern from Anthropic’s Building Effective Agents is a function. Wrap it in a Flask handler and ship it.

The threshold is roughly this: if any state in your agent must outlive the worker process that produced it, you need LangGraph or something like it. Below that threshold, you don’t. Above it, the alternatives are all heavier (Temporal, custom Celery + state store, your own DAG engine).

What to take away

LangGraph is not winning on developer experience or on a clever DSL. It is winning because the agents that have actually shipped to production in the last two years all share the same structural needs — durable state, human approvals, resumable runs — and LangGraph is the lowest-friction way to get them.

If you’re starting a new agent in 2026, the working assumption should be: prototype as a function, move to LangGraph the day the state has to survive a process restart, and don’t apologise for it. The companies running real agent workloads made that move; they made it because the alternative was writing the orchestrator themselves.


Further reading: the official LangGraph docs, the persistence concepts page, and the human-in-the-loop guide are the three pages worth bookmarking. For the broader story of why this shape of system matters, Anthropic’s Building Effective Agents remains the canonical reference.

Skip to content