datarekha
Agents May 26, 2026

The three kinds of memory production agents actually use

Working memory, episodic memory, semantic memory — the cognitive science taxonomy that every agent memory startup borrows from. Here's what each one means in practice, what Letta, Mem0, and Zep actually ship, and why most production agents only need the cheapest one.

13 min read · by datarekha · memoryagentsarchitectureproduction

Every pitch deck for an agent memory company shows the same diagram. There is a brain icon labelled “working memory” at the top, two filing cabinets labelled “episodic” and “semantic” underneath, and arrows between them suggesting a tidy flow from short-term to long-term storage. The diagram is borrowed wholesale from a 1972 cognitive psychology paper by Endel Tulving.

The thing nobody says on the slide is that most production agents only ever use the brain icon at the top. The filing cabinets exist, vendors will sell you ones with extra features, and the academic benchmarks reward using them. But the agents shipping in front of paying users — Cursor, Devin, Replit, ChatGPT, Claude — overwhelmingly route their state through a checkpointed scratchpad and call it a day.

This post unpacks the taxonomy, walks through what Letta, Mem0, and Zep actually ship under each category, and lays out the latency and cost numbers that explain why the simplest tier dominates production.

The three tiers, defined honestly

The Tulving taxonomy maps onto agent systems roughly like this:

TIER 1Workingmemory”the current turn”scratchpad, tool calls,plan-so-far, last N msgsretrieval: 0 mscost: token tax onlyTIER 2Episodicmemory”past sessions”summaries of priorconversations, eventsretrieval: 50-200 mscost: vector + summariserTIER 3Semanticmemory”durable facts”user prefs, businessentities, world modelretrieval: 100-500 mscost: KG / graph + LLMreadread
The three tiers and their working costs. Tier 1 is free if you’re already paying for the prompt. Tier 3 has the steepest cost curve and the highest implementation complexity.

Working memory is the conversation buffer plus whatever scratchpad your agent maintains during a single task. The plan-so-far, the last N tool calls, the partially-written code, the JSON state the orchestrator is mutating between worker calls. It lives in the prompt or in a short-lived data structure your code passes around. Cost is just the token tax of including it in the next call.

Episodic memory is the recollection of past sessions, condensed. “Last Tuesday the user asked me to refactor auth.py, I did, they accepted my diff.” It’s a summary plus enough metadata (timestamp, session id, outcome) to make later retrieval coherent. The retrieval substrate is typically a vector store over per-session summaries, with a re-ranker on top.

Semantic memory is the durable, often-deduplicated fact store. “The user prefers tabs over spaces. Their default deployment target is AWS us-east-1. Their company’s primary KPI is weekly active developers.” This is the one everyone is building elaborate machinery for — knowledge graphs, entity resolvers, contradiction detectors — because the academic problem here is genuinely hard.

The catch: in production, the three tiers are not independent products. Most user-facing agents will read from working memory every turn, episodic memory rarely, and semantic memory almost never. The ratio is roughly 100:5:1 by call volume in the systems I’ve measured.

How the three big vendors actually approach this

Letta (formerly MemGPT)

Letta is the production version of the MemGPT research paper, which framed LLM memory as an operating-system problem: there’s a small “core memory” that’s always in context, a larger “recall memory” of past messages, and an “archival memory” of long-term facts. The agent pages between them using function calls it generates itself — memory_insert, memory_search, memory_replace.

The clean part of Letta’s design is that core memory is just text in the system prompt, hot-swappable by the agent. The messy part is that archival paging depends entirely on the model’s judgment about what’s worth promoting. Most production deployments leave the consolidation unimplemented because the model’s promotion decisions are noisy.

Letta scores 83.2% on the LoCoMo long-conversation benchmark (or 74.0% in a simpler agent configuration on GPT-4o-mini), which is roughly state-of-the-art for the OS-style approach.

Mem0

Mem0 treats memory as an extraction-and-retrieval problem, not a paging problem. After each turn, an extraction LLM call identifies “memorable” facts and writes them to a vector store; on the next turn, a separate retrieval call surfaces only what’s relevant. There’s no notion of “core vs archival.” There is just the memory store, with a semantic retrieval interface.

Mem0’s April 2025 paper reports the numbers production teams care about:

  • p50 retrieval latency of 0.88s on LoCoMo, 1.09s on LongMemEval, 1.00-1.05s on BEAM.
  • Token cost averaging under 7,000 tokens per retrieval call, versus 25,000+ for full-context baselines (a ~90% reduction).
  • Accuracy of 67.13% LLM-as-a-Judge on LoCoMo, with the graph-augmented Mem0g variant slightly higher.
  • 26% relative improvement over OpenAI’s built-in ChatGPT memory in their benchmark.

There is a fight in the academic literature about exactly how Mem0’s LoCoMo numbers compare to Letta’s — Letta has publicly disputed Mem0’s methodology for how they ran MemGPT as a baseline — and the honest read is that both systems are competitive on benchmark and the differences are smaller than either marketing department implies.

Zep

Zep is the purest example of the “semantic memory is the hard problem” thesis. The core is Graphiti, a temporally-aware knowledge graph that ingests conversation events and updates a graph of entities, relationships, and the time ranges over which each relationship was valid. When you ask “who was the customer’s account manager last quarter?” Zep can answer because its facts are timestamped, not overwritten.

The Zep paper reports:

  • 94.8% on DMR (the original MemGPT benchmark), against MemGPT’s 93.4%.
  • Up to 18.5% accuracy improvement on LongMemEval with 90% latency reduction compared to a baseline RAG implementation.
  • Three subgraphs — episode, semantic entity, community — explicitly mapping to the Tulving tiers.

Zep’s bet is that temporal correctness matters in long-running agent deployments (CRM, customer success, healthcare) where state changes over time and you cannot just overwrite the previous fact. For pure question-answering agents the bet is weaker.

Why most agents only use tier 1

Here is the diagram nobody puts on a slide, but it’s the one that matches what ships:

CALL VOLUME BY MEMORY TIER, TYPICAL PRODUCTION AGENT~95%Workingprompt + scratchpad~4%Episodicprior-session summaries~1%Semanticdurable facts / KG
Indicative call-volume mix from observed production agents. The semantic tier is the most-discussed and the least-used.

The reasons are mostly mundane.

The prompt is already paying for working memory. Every agent already has a system prompt, tools description, and conversation history. Adding a plan-so-far field or a JSON scratchpad costs you a handful of extra tokens per turn, no infrastructure. If you’re using LangGraph or any checkpointed framework, the checkpointer is your working memory; you already wrote it for resumability, not memory.

Episodic memory is mostly a UX feature, not a model feature. When ChatGPT “remembers” you mentioned your dog’s name last week, that is a short summary string injected into the system prompt. It is genuinely useful for chatty consumer products and largely useless for task-bound B2B agents where each session is goal-directed and self-contained.

Semantic memory is hard, and the failure mode is silent. A knowledge graph that confidently stores a contradicting fact, or a deduplication step that merges two distinct entities, will degrade agent quality in ways your eval set probably can’t catch. The cost of running and operating a semantic memory system — write paths, contradiction detection, schema evolution, GDPR deletion — is proportional to the complexity of your domain, and it is rarely worth it unless your domain is genuinely entity-rich (CRM, healthcare, identity).

The Anthropic context engineering guidance leans the same direction: most agent failures are not fixed by adding a memory layer. They are fixed by making the agent’s current context sharper — better tool descriptions, clearer system prompt, structured scratchpad — i.e. by investing in tier 1.

What the production-shippers actually do

If you look at the architectures of the agents most likely to be in front of a paying user in 2026:

  • Cursor / Composer. Working memory only. Cursor checkpoints the agent’s plan, file edits, and tool call history to a local SQLite per project. There is no episodic memory across projects in the product; “Cursor remembers your style” is a function of having .cursor/rules files in your repo, which is just a static prompt augmentation.
  • Devin. A long-horizon working memory with explicit user-initiated promotion to a “memory” surface. The user can say “remember that we use Jest, not Mocha” and Devin will save that into a memory snippet the agent reads back at the start of future sessions. There is no automatic consolidation pipeline — the promotion is human-in-the-loop.
  • Claude / ChatGPT. A small set of user-pinned facts injected as a system-prompt prefix. Both ship a UI that shows you what’s “remembered.” Both let you delete entries. Neither uses a knowledge graph; both use a string-list-of-facts model with retrieval gated by a small LLM call.
  • Block’s Goose. Working memory only, with a “rolling summary” of the conversation when context gets long.

The pattern: shippers underbuild memory and overinvest in observability of the agent’s current state. The expensive memory products are purchased by enterprise customers who think they need them, and adopted by AI infrastructure teams who know they probably don’t, and mostly end up unused by the engineers actually building features.

When you genuinely do need tier 2 or 3

The honest minority of cases where semantic memory pays off:

  1. Personal assistants with months-long horizon. “Did I tell you about the dinner I had with Sarah?” needs a model that has stored Sarah, the dinner, and the relationship. ChatGPT’s memory is sufficient for casual use; Zep-like systems are warranted when the assistant is doing real work over many months and forgetting is a UX disaster.
  2. CRM and sales-engineering agents. “Update the account with this call’s notes” requires the agent to know what an “account” is, what fields exist, and what was already there. This is fundamentally a knowledge-graph problem.
  3. Healthcare and compliance. Temporal correctness (“when did this patient start that medication?”) is non-negotiable. Zep’s temporal subgraph is genuinely the right primitive here.
  4. Multi-tenant enterprise deployments where each tenant has hundreds of business facts that need to be visible to their agent and invisible to everyone else’s. The isolation boundary plus the volume justifies real storage.

If you’re not in one of those buckets, you almost certainly don’t need semantic memory yet. You need a better system prompt and a checkpointer.

The cost math nobody walks through

The conversation about memory architectures usually skips past the operational cost, which is the part that gets you in trouble at scale. Concrete back-of-envelope for a workload of 100,000 active users, each sending an average of 12 turns per day:

MONTHLY COST OF EACH MEMORY TIER, 100K-USER AGENT~$2k/moTier 1prompt tokens only~$8-12k/moTier 1+2+ vector store + summariser~$20-30k/moTier 1+2+3+ KG / graph extraction
Rough order-of-magnitude monthly cost for the same agent at different memory depths. Tier 3 adds an LLM call per turn for entity extraction plus a graph database; tier 2 is dominated by vector storage and per-session summarisation.

The numbers are illustrative, but the structural point holds. Tier 3 roughly 10x’s the operational cost of an agent that does the same job with tier 1 alone. The premium is justifiable for the genuinely entity-rich workloads in the previous section. It is not justifiable for most chatbots, most agentic SaaS features, or most coding assistants.

The hidden cost most teams underestimate is the write path. Every turn that produces a memorable fact triggers an LLM call to extract entities, then a graph write, then sometimes a reconciliation pass to detect contradictions. Mem0’s paper is admirably specific about this — the extraction pass adds 200-400ms to write latency. For high-throughput agents the write path becomes the bottleneck before the read path does.

The contrarian opinion

The agent memory category is real and important — and also massively oversold for the workload most teams have. The Cambrian explosion of memory startups in 2024-25 reflects the academic incentive to publish on LongMemEval more than it reflects what production agents actually do.

For the median agent project in 2026, the right memory stack is:

  • Tier 1: a JSON scratchpad in your checkpointer, with explicit fields for plan, current step, and last error.
  • Tier 2: a single string field of “user-pinned facts” the user themselves manages, max 1 KB, injected into the system prompt.
  • Tier 3: nothing.

If that becomes the bottleneck, then invest in Mem0 or Zep. Not before. And when you do, treat the chosen memory system as infrastructure with a service-level objective — retrieval p95 budget, accuracy on a held-out set, cost per retrieval — not as a magic dependency that will make your agent feel smarter.

The headline of the memory wars is that all three vendors built genuinely impressive systems. The footnote is that for the customer base they actually serve, the cheap tier is usually enough.


Further reading: the MemGPT paper and the more recent Letta v1 agent post lay out the OS-style position. Mem0’s arXiv preprint is the most quantitatively honest of the three vendor papers. Zep’s temporal-graph paper is the right read if your domain has real temporal structure. For Anthropic’s own framing of when to add memory at all, see their context management writeup.

Skip to content