We built the same agent three times: MAF, LangGraph, and ADK

There is no universally correct agent framework in 2026. There is the one your shop has gravity toward, and there are the ones you’d reach for given different constraints. Anyone telling you LangGraph is strictly better than ADK, or MAF is the right choice for everyone, has not built the same agent three different ways.

I have. Below is the honest version of that exercise: one problem — an invoice-processing agent with reasonable production constraints — built on LangGraph, Microsoft Agent Framework, and Google’s ADK. What each one made easy. What each one made painful. Where the cleanest design lives in each.

The problem

An invoice-processing agent for a mid-size company. The requirements are deliberately a little annoying, because that’s where frameworks earn or lose their keep:

Ingest a PDF invoice from email (Outlook in our case, but abstract).
Extract structured fields (vendor, line items, amounts, PO numbers).
Validate against the ERP — does this PO exist? Does the amount match?
If anything fails validation, pause and ask a human in Slack what to do.
On approval, write the invoice to the ERP and notify accounting.
The whole run must be auditable: every step, every tool call, every human decision recorded.

Roughly half of real-world enterprise agent workloads have this shape. It is a long-running workflow with one or two genuinely model-decision moments and a human in the loop. So: how does each framework take it?

The agent in the abstract. Most of the structure is deterministic; the LLM-driven parts are extraction and the “should we ask a human?” decision.

Build 1: LangGraph

The LangGraph version is structurally the closest to how I’d whiteboard this. Nodes for each step, a conditional edge from validate into either human_gate or directly to write, a Postgres checkpointer, an interrupt() call in the gate node.

class State(TypedDict):
    pdf_url: str
    extracted: dict
    validation_errors: list[str]
    human_decision: str | None
    erp_id: str | None

def extract(state):
    fields = vision_llm.invoke(state["pdf_url"])
    return {"extracted": fields}

def validate(state):
    errors = erp.check_po(state["extracted"])
    return {"validation_errors": errors}

def human_gate(state):
    decision = interrupt({
        "errors": state["validation_errors"],
        "extracted": state["extracted"],
    })
    return {"human_decision": decision}

def write(state):
    return {"erp_id": erp.create(state["extracted"])}

def needs_human(state) -> str:
    return "human_gate" if state["validation_errors"] else "write"

g = StateGraph(State)
g.add_node("extract", extract)
g.add_node("validate", validate)
g.add_node("human_gate", human_gate)
g.add_node("write", write)
g.add_edge(START, "extract")
g.add_edge("extract", "validate")
g.add_conditional_edges("validate", needs_human)
g.add_edge("human_gate", "write")
g.add_edge("write", END)

app = g.compile(checkpointer=PostgresSaver.from_conn_string(POSTGRES_URL))

What this made easy:

The human gate is one function call. interrupt() does exactly what you want — pause, persist, expose the payload, resume later with Command(resume=...). Wiring the Slack integration was a hundred lines on the other side; the framework side was trivial.
Crash recovery for free. The worker can die between any two nodes; the next worker reads the checkpoint and continues. We tested this by killing the pod mid-validation; the next pod resumed from the post-extract checkpoint.
The conditional edge is the conditional edge. No “agent decides via tool call whether to escalate” wizardry. Code decides, model decides only where it has to.

What it made painful:

Slack integration is your job. LangGraph gives you the interrupt; you build the part that posts to a channel, waits for a reaction, and calls app.invoke(Command(resume=...)) when it arrives. Maybe two hundred lines of glue.
The audit log is your job, too. LangSmith does a beautiful job of tracing, but if your auditor wants a structured per-step record in your own database, you write that.
Observability requires opinions. Either you commit to LangSmith or you wire OpenTelemetry yourself.

Build 2: Microsoft Agent Framework

The MAF version is the most interesting one to write up, because it forces a design choice the other two don’t: where does the deterministic stop and the agent start?

The clean MAF design uses a Workflow as the outer skeleton, with a single Agent invoked for the extraction step (where the model genuinely has to think). Everything else — validation, the human gate, the ERP write — is a typed activity inside the workflow.

using Microsoft.Agents.AI;
using Microsoft.Agents.AI.Workflows;

var workflow = new WorkflowBuilder()
    .AddActivity<DownloadPdf>("download")
    .AddAgent<InvoiceExtractor>("extract")
    .AddActivity<ValidateAgainstErp>("validate")
    .AddConditional("needs_human",
        when: state => state.ValidationErrors.Any(),
        then: branch => branch.AddHumanApproval("human_gate"))
    .AddActivity<WriteToErp>("write")
    .AddActivity<NotifyAccounting>("notify")
    .Build();

var run = await workflow.StartAsync(new { PdfUrl = pdfUrl });

What this made easy:

The deterministic / non-deterministic split is enforced. The workflow runtime gives every activity an idempotency key and a durable history. The audit team loved this — every successful run is a deterministic replay of activities and their results, even if the embedded agent’s tool calls were stochastic.
The human approval primitive is a first-class node type. Same underlying pattern as LangGraph’s interrupt(), but it’s a typed node with a payload schema; the Foundry UI can render the approval card without writing UI code.
Azure Application Insights traces the whole thing as one distributed trace. Activity spans, agent spans, tool spans, human-gate spans, all in one waterfall. This was the single most pleasant observability experience of the three.

What it made painful:

The ceremony is real. Each activity is its own class. You don’t just write def validate(state): ... — you write a ValidateAgainstErp : IActivity<InvoiceState, ValidationResult> class with explicit input and output types.
Local development is heavier. You need either the local Durable Task emulator or a Foundry sandbox. LangGraph runs in a Python REPL.
You’re paying for things you don’t need if you’re not on Azure. The whole story assumes Foundry is in the picture. If you’re running MAF off-Azure, you give up most of what makes it pleasant.

The architectural payoff is real, though. When the compliance team asked for a guarantee that every invoice over a threshold gets a human approval and the approval is recorded, the answer was “look at the workflow definition, the AddConditional is the policy, here’s the trace.” That conversation was structurally easier in MAF than in the other two.

Build 3: Google ADK

The ADK version is the smallest of the three by line count, and the one that takes the most opinionated stance: this is a Gemini agent with tools, and you trust the model to drive the flow.

from google.adk.agents import Agent
from google.adk.tools import FunctionTool

def extract_invoice(pdf_url: str) -> dict:
    return gemini_vision.extract(pdf_url)

def validate_po(invoice: dict) -> dict:
    return erp.check_po(invoice)

def request_human_review(invoice: dict, errors: list[str]) -> str:
    return slack.request_review(invoice, errors)

def write_invoice(invoice: dict) -> str:
    return erp.create(invoice)

invoice_agent = Agent(
    name="invoice_processor",
    model="gemini-2.5-pro",
    instruction=(
        "Process an invoice PDF. Extract fields, validate against the ERP, "
        "and write it. If validation fails, request human review *before* writing."
    ),
    tools=[
        FunctionTool(extract_invoice),
        FunctionTool(validate_po),
        FunctionTool(request_human_review),
        FunctionTool(write_invoice),
    ],
)

You then deploy with adk deploy agent-engine and you’re done.

What this made easy:

Speed to first running version. Total code: under a hundred lines including the tools. The Agent Engine deployment was a single command. Whatever else you say about ADK, the “from idea to a working agent on production infra” story is the shortest of the three.
Gemini’s vision was native. The PDF-to-fields extraction used Gemini’s multimodal input directly; no separate vision pipeline.
Cloud Trace was free. Every tool call and every model turn showed up in Google Cloud Trace as a span. No setup.

What it made painful:

The flow is the model’s job, not the code’s. ADK is happy to let the LLM decide whether to call request_human_review or skip straight to write_invoice. The instruction says “request review before writing if validation fails” — but you are trusting the model to follow that instruction every time, including on edge-case invoices the prompt didn’t anticipate.
The auditor’s question — “how do we guarantee a human approval for invoices over $X?” — does not have a clean answer. You can add a check inside write_invoice itself, but now your domain logic is enforcing the policy because the framework can’t.
Deterministic re-runs aren’t really a thing. You can replay the prompt and tools, but you’ll get a different (and possibly different-tooled) execution. The MAF workflow’s “this exact sequence of activities ran” property doesn’t exist here.

For a different problem — one where the model genuinely should be choosing what to do step by step — ADK’s looseness is a feature. For this problem, it was a fight.

What each one was best at, concretely

After shipping all three, here’s the honest scorecard, which is not a feature matrix:

LangGraph won on the human-in-the-loop. The interrupt() primitive and the checkpoint-resume story are uniquely clean. If human approvals are central to your agent, this is the framework that thought about it hardest.
MAF won on the audit story. The deterministic Workflow + typed Agent split made the regulated-workload conversation easy. The Foundry tracing was the best UX of the three.
ADK won on time-to-first-deploy. The smallest amount of code, the most “Gemini just does the thing” moments, and a deploy step that genuinely works first time.

Nobody won on portability across providers. Each one’s superpowers disappear when you take it out of its native ecosystem.

A practical recommendation

If I were starting a new agent in 2026, with no constraints from an existing tech stack, this is the rough cut:

If the agent has to wait on humans, run for hours, or resume across deploys — LangGraph, with the checkpointer wired to your Postgres.
If the agent runs in a regulated enterprise on Azure or .NET, or the auditor is going to ask “show me the deterministic part” — MAF, with workflows as the spine and agents as the leaves.
If you’re on GCP, your model is Gemini, and you want to ship a prototype to production by the end of the week — ADK.

The frameworks have converged enough that none of them are wrong for any of these workloads. They’re each meaningfully more pleasant in their home ecosystem, and meaningfully heavier outside it.

What to take away

Three frameworks, one problem, three different defensible designs. The lesson, after writing all three:

The framework is a small fraction of the work. Tools, prompts, evaluation, and integration are the bulk. Every framework here got out of the way enough to ship.
Where the framework’s gravity lives is more important than its API. LangGraph’s gravity is in the state machine. MAF’s gravity is in the audit trail. ADK’s gravity is in the Gemini + GCP bundle. Pick the gravity that matches your real-world constraints.
There is no portable agent stack yet. MCP is making tools portable; A2A is making inter-agent calls portable; the orchestration layer itself is still tied to its vendor. That’ll probably change in the next two years. For now, pick the lane that suits the trip.

Further reading: the LangGraph human-in-the-loop concepts page, the MAF announcement post on DevBlogs, and the ADK documentation cover each framework in the depth this post couldn’t.