datarekha

Agent Memory — Working, Episodic & Semantic

How agents remember across turns and sessions. The cognitive-science taxonomy production systems borrow — working, episodic, semantic, procedural — what Letta, Mem0, and Zep actually ship, why memory is not RAG, and why most agents need far less of it than the hype implies.

8 min read Intermediate Agentic AI Lesson 25 of 29

What you'll learn

  • The four memory types — working, episodic, semantic, procedural — and what each is for
  • Why "working memory" in production is just checkpointed thread state, not a magic buffer
  • How memory differs from RAG (and why the distinction is load-bearing)
  • The vendor landscape — Letta/MemGPT, Mem0, Zep/Graphiti — without the hype
  • Why most agents need only working memory plus maybe a thin slice of semantic

Before you start

The honest version of this topic is shorter than the marketing version. Memory in an agent is just state that survives — across turns, or across sessions. The interesting question is which state survives, how long, and where it lives. The field borrowed a taxonomy from cognitive science to name the answers, and that taxonomy — working, episodic, semantic, procedural — is now the reference vocabulary, anchored by the CoALA paper (Cognitive Architectures for Language Agents, 2023).

Four kinds of memory

WorkingEPHEMERALthe live context window / scratchpad• this turn’s inputs + reasoning• in practice: checkpointed thread state• clears when the session endsEpisodicDURABLEa log of past interactions / experiences• event histories, trajectories• retrieved later by similarity• “what happened last Tuesday?”SemanticDURABLEde-contextualized facts & knowledge• user preferences, profile• vector store or knowledge graph• “Maya prefers aisle seats”ProceduralFOUNDATIONALlearned how-to / skills• implicit: baked into the weights• explicit: the agent’s code & prompts• designer-set; risky to mutate live
Same word, “memory” — four mechanisms, three lifespans. The dashed box is the only ephemeral one.

Read the diagram by lifespan, not by name. Working memory is the only ephemeral one — it is the live context the model reasons over this turn, and it is gone when the session ends. The other three persist. Episodic logs what happened (events, trajectories) and is recalled by similarity later. Semantic holds durable facts — the user’s name, their preferences — the de-contextualized things you want true tomorrow. Procedural is the agent’s how-to: mostly its code and prompts, plus whatever is implicit in the model weights. CoALA stresses procedural memory “must be initialized by the designer” — it is foundational and risky to rewrite at runtime, so most systems treat it as static.

A common pattern links the durable two: repeated episodic observations get consolidated into one semantic fact. The agent notices “user corrected the date format three times” (three episodes) and abstracts it into a single durable preference: “user prefers DD/MM/YYYY.”

The production reality: working memory is checkpointed state

Here is the part the cognitive-science framing obscures. In real systems, “working memory” is not a special biological buffer. It is checkpointed agent state — and you have already met it. In LangGraph persistence, short-term memory is thread-scoped state: a checkpointer (InMemorySaver, PostgresSaver, Redis) snapshots the graph after each node and restores it on the next invoke for the same thread_id. Start a new thread_id and that state vanishes. That is working memory: ephemeral in spirit, but durably snapshotted so the agent can resume mid-conversation.

This is what the overwhelming majority of agents actually rely on day to day. The elaborate stack often is not needed.

Memory is not RAG

This is the distinction that trips everyone up, so be precise about it.

RAG retrieves external documents at query time to answer a question, then forgets. It is stateless and read-only at query time: pull the top-k relevant chunks from an index, stuff them in context, answer “what does the document say?”, done. (The retrieval machinery — embeddings and vector search — is shared with memory, which is why the two get conflated.)

Memory persists the agent’s own state, history, and learned facts across sessions. Crucially, memory has a distinct write phase that RAG lacks: extract a fact, decide whether to store it, store it — and only then, later, read it back. Letta puts it bluntly: retrieval “is a tool for agent memory, [but] it is not memory in of itself.”

RAGexternal knowledge, stateless“What does the document say?”• pulls chunks from a doc index• read-only at query time• no write phase — forgets afterabout the WORLD’s documentsMemorythe agent’s own state, stateful“What has this user told me?”• persists history & learned facts• distinct WRITE phase, then read• survives across sessionsabout the AGENT & the USER
Same retrieval plumbing, opposite jobs. They are complementary — many agents use RAG for knowledge AND memory for personalization.

They are not rivals. A production agent often uses RAG to answer questions about a knowledge base and memory to personalize across sessions.

See it: run a session, then start a new one

The widget below is the whole lesson in miniature. Run a short sequence of interactions, watch which panel each piece of information lands in and how the context-window meter fills with working memory — then hit Start new session and see the split: working memory clears, the durable stores survive, and a retrieval step pulls them back into the fresh context window.

The aha: not all memory is the same. “Maya’s name” went to a durable semantic slot because you want it true forever; “booked a flight to Tokyo” went to episodic because it is an event you might recall later; the raw turns sat in working memory and evaporated the moment the session ended. Pick the memory type for the durability you need — and notice how little of it the durable side actually has to hold.

The vendor landscape (without the hype)

Three names come up constantly. They are genuinely good engineering; the trap is reaching for them before you need them.

SystemWhat it actually isMemory model
MemGPT / LettaThe OS-analogy framework. MemGPT (2023) introduced virtual context management — pages data between the window and external storage via tool calls. Letta is the production framework descended from it.Three tiers: core (editable in-context “memory blocks”, like RAM) / recall (searchable conversation history, auto-saved) / archival (external DB, vector or graph, queried on demand). The agent moves data between tiers itself.
Mem0A commercial memory layer you bolt onto an agent.Two-phase pipeline per exchange: extraction (an LLM distills messages into candidate facts) then update (each candidate is matched by vector similarity, and an LLM picks ADD / UPDATE / DELETE / NOOP to stay consistent). Vector store + optional entity-relationship graph memory.
Zep / GraphitiA temporal knowledge-graph memory service.Bi-temporal: tracks both when an event occurred and when it was ingested; every edge carries a validity interval. When new facts conflict with old ones it invalidates the stale edge rather than deleting it — history is preserved.

A few honest caveats. The “context window = RAM, external store = disk” analogy is MemGPT’s, and it is genuinely useful. Zep reports edging out MemGPT on the Deep Memory Retrieval benchmark (94.8% vs 93.4%) and large gains on the harder LongMemEval — but those are the vendor’s own paper figures, not independently reproduced here, so read them as “Zep reports,” not as neutral fact. For the open-source primitives, LangGraph/LangMem give you the two pieces directly: thread-scoped checkpointers (working memory) and a cross-thread store with namespaces and optional semantic search (long-term memory). Anthropic and OpenAI both ship first-party memory features too — Anthropic’s file-based memory tool lets the model keep a NOTES.md-style file in a /memories directory that persists across sessions (public beta).

If you want the longer tour of these architectures, the blog post the three kinds of memory production agents actually use goes deeper on the trade-offs.

Governance: the actually-hard part

Storing a fact is easy. The hard problems are what to remember, when to forget, and how to correct.

  • What to remember. You cannot keep everything — promote only salient facts. Mem0’s LLM-judged ADD/UPDATE/DELETE/NOOP is exactly this selective-promotion decision made automatically.
  • Correction & expiry. Better systems version rather than destroy. Zep invalidates stale edges via validity intervals (keeping the history); Mem0 issues UPDATE/DELETE. Blind overwrite loses the audit trail.
  • Context bloat. Stuffing the window with retrieved memories leaves the model less room to reason and measurably degrades quality. This is why the frontier mantra is context engineering — keep the minimal set of high-signal tokens, retrieve just in time. Bigger context windows shift this trade-off; they do not solve the governance problem.

Quick check

Quick check

0/3
Q1A user says 'my name is Maya' in session 1. In session 2 (a fresh thread), the agent still greets her by name. Which memory type made that possible — and which one did NOT?
Q2What is the load-bearing difference between memory and RAG?
Q3TRANSFER: You are building a customer-support copilot. It must (a) answer policy questions from a 500-page handbook, and (b) remember each customer's plan tier and past tickets. A teammate proposes a full Letta-style core/recall/archival stack PLUS a Zep temporal graph PLUS Mem0 for everything. What is the most defensible critique?

Next

You now have the taxonomy and, more importantly, the judgment to use less of it than the hype implies. Two threads continue from here: the engineering of durable state lives in LangGraph persistence (checkpointers and stores), and the risk lives in memory poisoning — because the moment memory persists, it becomes something an attacker wants to write to.

Practice this in an interview

All questions
What is Retrieval-Augmented Generation (RAG) and why is it used?

RAG couples a retrieval step — fetching relevant documents from an external store — with a generative model so the LLM can answer questions about knowledge it was never trained on. It solves the stale-knowledge and hallucination problems without retraining. The pattern is preferred when the knowledge base changes frequently or contains proprietary data.

How do function/tool calling and LLM agents work at a high level?

Tool calling extends the LLM's output space to include structured function invocations. The model emits a JSON object naming a tool and its arguments; the runtime executes the tool and feeds the result back as a new message. An agent is a loop that repeats this cycle — observe, think, act — until the task is complete or a stopping condition is met.

When should you use RAG vs fine-tuning vs a long-context model?

RAG is the default for dynamic, proprietary, or frequently updated knowledge. Fine-tuning is correct when you need to change the model's behavior, format, or domain-specific reasoning style — not just its knowledge. Long-context models are appropriate when your entire knowledge base fits in a single context window and latency is acceptable.

What chunking strategies exist for RAG and how do you choose between them?

Chunking splits source documents into retrievable units before embedding. The right strategy depends on document structure, query style, and the model's context window. Fixed-size chunks are simple but break mid-sentence; semantic or structural chunking preserves coherence; hierarchical chunking enables parent-document retrieval for richer context.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Explore further

Related lessons

Skip to content