What types of memory do agents use, and what is context engineering and compaction?

Agents use short-term memory (the working context window) and long-term memory stored in vector databases or files, often split into episodic, semantic, and procedural memory. Context engineering is the discipline of curating what goes into the limited context window, and compaction summarizes or prunes older history so the agent retains key information without overflowing the window or degrading from too much noise.

What is an AI agent, and how does it differ from a single LLM call?

An agent is an LLM placed in a loop where it reasons, chooses and calls tools or actions, observes the results, and repeats until a goal is met, rather than producing one response and stopping. The key differences are autonomy, tool use, memory and state, and multi-step control flow driven by the model's own decisions.

What is Retrieval-Augmented Generation (RAG) and why is it used?

RAG couples a retrieval step — fetching relevant documents from an external store — with a generative model so the LLM can answer questions about knowledge it was never trained on. It solves the stale-knowledge and hallucination problems without retraining. The pattern is preferred when the knowledge base changes frequently or contains proprietary data.

What is Retrieval-Augmented Generation (RAG) and how does a basic RAG pipeline work?

RAG augments an LLM by retrieving relevant documents from an external knowledge store at query time and feeding them into the prompt as grounding context. A basic pipeline chunks and embeds documents into a vector store, retrieves the top-k most similar chunks for a query, and the LLM generates an answer conditioned on them, reducing hallucination and keeping knowledge current.

Agent Memory — Working, Episodic & Semantic — Agentic AI

The honest version of this topic is shorter than the marketing version. Memory in an agent is just state that survives — across turns, or across sessions. The interesting question is which state survives, how long, and where it lives. The field borrowed a taxonomy from cognitive science to name the answers, and that taxonomy — working, episodic, semantic, procedural — is now the reference vocabulary, anchored by the CoALA paper (Cognitive Architectures for Language Agents, 2023).

TryMemory-type explorer

Run a session, then start a new one — watch what survives

Each turn lands in a memory type. Working memory fills the live context window; durable facts get promoted to episodic or semantic stores. Hit Start new session and the aha lands: working memory clears, the durable stores don’t.

Context window (session 1)0% full

Working memory = ephemeral, checkpointed thread state. It resets per session.

Workinglive context

The live context window / scratchpad. Ephemeral — checkpointed thread state.

empty

Episodicinteraction log

A log of past interactions. Retrieved later by similarity.

empty

Semanticdurable facts

Durable facts about the user & world. Vector store or knowledge graph.

empty

Proceduralhow-to / skills

Learned how-to. Mostly designer-provided code & prompts — rarely mutated at runtime.

set by the designer

Next up: “Remember my name is Maya.”

Four kinds of memory

Same word, “memory” — four mechanisms, three lifespans. The dashed box is the only ephemeral one.

Read the diagram by lifespan, not by name. Working memory is the only ephemeral one — it is the live context the model reasons over this turn, and it is gone when the session ends. The other three persist. Episodic logs what happened (events, trajectories) and is recalled by similarity later. Semantic holds durable facts — the user’s name, their preferences — the de-contextualized things you want true tomorrow. Procedural is the agent’s how-to: mostly its code and prompts, plus whatever is implicit in the model weights. CoALA stresses procedural memory “must be initialized by the designer” — it is foundational and risky to rewrite at runtime, so most systems treat it as static.

A common pattern links the durable two: repeated episodic observations get consolidated into one semantic fact. The agent notices “user corrected the date format three times” (three episodes) and abstracts it into a single durable preference: “user prefers DD/MM/YYYY.”

The production reality: working memory is checkpointed state

Here is the part the cognitive-science framing obscures. In real systems, “working memory” is not a special biological buffer. It is checkpointed agent state — and you have already met it. In LangGraph persistence, short-term memory is thread-scoped state: a checkpointer (InMemorySaver, PostgresSaver, Redis) snapshots the graph after each node and restores it on the next invoke for the same thread_id. Start a new thread_id and that state vanishes. That is working memory: ephemeral in spirit, but durably snapshotted so the agent can resume mid-conversation.

This is what the overwhelming majority of agents actually rely on day to day. The elaborate stack often is not needed.

Memory is not RAG

This is the distinction that trips everyone up, so be precise about it.

RAG retrieves external documents at query time to answer a question, then forgets. It is stateless and read-only at query time: pull the top-k relevant chunks from an index, stuff them in context, answer “what does the document say?”, done. (The retrieval machinery — embeddings and vector search — is shared with memory, which is why the two get conflated.)

Memory persists the agent’s own state, history, and learned facts across sessions. Crucially, memory has a distinct write phase that RAG lacks: extract a fact, decide whether to store it, store it — and only then, later, read it back. Letta puts it bluntly: retrieval “is a tool for agent memory, [but] it is not memory in of itself.”

Same retrieval plumbing, opposite jobs. They are complementary — many agents use RAG for knowledge AND memory for personalization.

They are not rivals. A production agent often uses RAG to answer questions about a knowledge base and memory to personalize across sessions.

Watch the session boundary

Trace one concrete story across two sessions. In session 1 the user says “my name is Maya,” books a flight to Tokyo, and chats — all of it sitting in working memory, the live context window. As the salient facts surface, the agent writes the durable ones out: “Maya prefers aisle seats” to semantic, “booked a Tokyo flight” to episodic. Then the session ends.

One story, two sessions: working memory clears on a new thread; the durable semantic and episodic stores survive and are retrieved back into the fresh window.

When session 2 opens on a new thread_id, working memory is empty — the live context cleared with the old session. The durable stores survive, and a retrieval step pulls the relevant facts back into the fresh window, so the agent greets Maya by name without her repeating it.

The aha: not all memory is the same. “Maya’s name” went to a durable semantic slot because you want it true forever; “booked a flight to Tokyo” went to episodic because it is an event you might recall later; the raw turns sat in working memory and evaporated the moment the session ended. Pick the memory type for the durability you need — and notice how little of it the durable side actually has to hold.

The vendor landscape (without the hype)

Three names come up constantly. They are genuinely good engineering; the trap is reaching for them before you need them.

System	What it actually is	Memory model
MemGPT / Letta	The OS-analogy framework. MemGPT (2023) introduced virtual context management — pages data between the window and external storage via tool calls. Letta is the production framework descended from it.	Three tiers: core (editable in-context “memory blocks”, like RAM) / recall (searchable conversation history, auto-saved) / archival (external DB, vector or graph, queried on demand). The agent moves data between tiers itself.
Mem0	A commercial memory layer you bolt onto an agent.	Two-phase pipeline per exchange: extraction (an LLM distills messages into candidate facts) then update (each candidate is matched by vector similarity, and an LLM picks `ADD` / `UPDATE` / `DELETE` / `NOOP` to stay consistent). Vector store + optional entity-relationship graph memory.
Zep / Graphiti	A temporal knowledge-graph memory service.	Bi-temporal: tracks both when an event occurred and when it was ingested; every edge carries a validity interval. When new facts conflict with old ones it invalidates the stale edge rather than deleting it — history is preserved.

A few honest caveats. The “context window = RAM, external store = disk” analogy is MemGPT’s, and it is genuinely useful. Zep reports edging out MemGPT on the Deep Memory Retrieval benchmark (94.8% vs 93.4%) and large gains on the harder LongMemEval — but those are the vendor’s own paper figures, not independently reproduced here, so read them as “Zep reports,” not as neutral fact. For the open-source primitives, LangGraph/LangMem give you the two pieces directly: thread-scoped checkpointers (working memory) and a cross-thread store with namespaces and optional semantic search (long-term memory). Anthropic and OpenAI both ship first-party memory features too — Anthropic’s file-based memory tool lets the model keep a NOTES.md-style file in a /memories directory that persists across sessions (public beta).

If you want the longer tour of these architectures, the blog post the three kinds of memory production agents actually use goes deeper on the trade-offs.

Governance: the actually-hard part

Storing a fact is easy. The hard problems are what to remember, when to forget, and how to correct.

What to remember. You cannot keep everything — promote only salient facts. Mem0’s LLM-judged ADD/UPDATE/DELETE/NOOP is exactly this selective-promotion decision made automatically.
Correction & expiry. Better systems version rather than destroy. Zep invalidates stale edges via validity intervals (keeping the history); Mem0 issues UPDATE/DELETE. Blind overwrite loses the audit trail.
Context bloat. Stuffing the window with retrieved memories leaves the model less room to reason and measurably degrades quality. This is why the frontier mantra is context engineering — keep the minimal set of high-signal tokens, retrieve just in time. Bigger context windows shift this trade-off; they do not solve the governance problem.

In one breath

“Memory” is just state that survives — across turns or sessions; the real questions are which state survives, how long, and where it lives.
Four types, read by lifespan: working (ephemeral live context = checkpointed thread state), episodic (durable event log, recalled by similarity), semantic (durable facts/preferences), procedural (the agent’s code/prompts/weights — foundational, designer-set).
Memory is not RAG: RAG reads external documents at query time and forgets; memory persists the agent’s own state with a distinct write phase (extract → decide → store), then reads it back later.
The honest thesis: most agents need only working memory plus a thin slice of semantic — full episodic + semantic + graph + procedural stacks (Letta, Mem0, Zep) are frequently over-engineering.
The hard part is governance — what to remember, when to forget/correct (version, don’t blind-overwrite), and context bloat — and durable memory is an attack surface (poisoning persists across sessions).

Quick check

0/3

Q1A user says 'my name is Maya' in session 1. In session 2 (a fresh thread), the agent still greets her by name. Which memory type made that possible — and which one did NOT?

Q2What is the load-bearing difference between memory and RAG?

Q3TRANSFER: You are building a customer-support copilot. It must (a) answer policy questions from a 500-page handbook, and (b) remember each customer's plan tier and past tickets. A teammate proposes a full Letta-style core/recall/archival stack PLUS a Zep temporal graph PLUS Mem0 for everything. What is the most defensible critique?

You now have the taxonomy and, more importantly, the judgment to use less of it than the hype implies. Two threads continue from here: the engineering of durable state lives in LangGraph persistence (checkpointers and stores), and the risk lives in memory poisoning — because the moment memory persists, it becomes something an attacker wants to write to.

Agent Memory — Working, Episodic & Semantic

What you'll learn

Before you start

Run a session, then start a new one — watch what survives

Four kinds of memory

The production reality: working memory is checkpointed state

Memory is not RAG

Watch the session boundary

The vendor landscape (without the hype)

Governance: the actually-hard part

In one breath

Quick check

Quick check

Next

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further