Mem0 — a memory layer for agents
How Mem0 extracts salient facts from conversations and retrieves only the relevant ones at the next turn, giving agents durable personalized memory without stuffing the full history into every prompt.
What you'll learn
- Why LLMs are stateless and why naive context-stuffing is expensive and fragile
- What a memory layer is: extraction, storage, and relevance-ranked retrieval
- How to use Mem0's Python API to add memories and search them across sessions
Before you start
The core problem: LLMs are stateless
Every call to a language model starts fresh. The model has no persistent state of its own — whatever context you send is all it knows about the world beyond its training data. This means “memory” is always your responsibility as the developer: you decide what goes into the prompt, and you pay (in tokens and latency) for every word of it.
The naive approach is context stuffing — prepending the full conversation history to every new turn. It works for short sessions but breaks down quickly:
- Cost scales linearly. Ten prior sessions of 2,000 tokens each means 20,000 tokens in the context before the user even types their first word.
- Relevance collapses. A month-old discussion about a cancelled order is almost never relevant to today’s shipping question, yet it still occupies prime context real estate.
- Context limits bite. Even with very large windows, there is a ceiling, and retrieval quality often degrades long before you reach it.
What you actually want is selective long-term memory: keep the facts that matter, discard the noise, and inject only what is relevant for the current turn.
What Mem0 is
Mem0 is a memory layer — a service that sits between your agent and the underlying LLM, handling the extraction, storage, and retrieval of salient facts across sessions. The core idea:
- After (or during) a conversation turn, Mem0 passes the exchange to an LLM that extracts discrete memory items — compact facts like “User prefers vegetarian food” or “User’s timezone is Asia/Kolkata”.
- Those items are stored in a vector store (Mem0 manages this for you, or you can point it at your own).
- On the next turn, before you call your agent LLM, you query Mem0 with the current message. It performs a semantic search over stored memories and returns the top-k most relevant ones.
- You inject those memories into your system prompt. The agent sees only what is relevant — not the full history.
The extraction step is the key innovation. Rather than storing raw conversation turns, Mem0 distills them into a structured fact graph that can be searched efficiently and stays compact even after hundreds of sessions.
Memory types
Mem0 organizes memories along three axes, which map directly to the taxonomy from the broader agent memory literature:
| Type | Scope | Example |
|---|---|---|
| User memory | Tied to a specific user ID | ”Prefers dark mode”, “Vegetarian” |
| Agent memory | Tied to the agent itself | ”Always greet users by first name” |
| Session memory | Scoped to a single run | ”Currently troubleshooting login issue” |
In practice, user memory is the most commonly used — it is what lets your agent feel like it knows a returning customer.
The architecture at a glance
Top row: a conversation turn is distilled into facts and stored. Bottom row: the next session retrieves only relevant memories and injects them into context.
Why this beats naive context-stuffing
| Approach | Cost per turn | Relevance | Session limit |
|---|---|---|---|
| Full history replay | O(total tokens) | Low — everything in | Hits window fast |
| Mem0 memory layer | O(k extracted facts) | High — semantic match | Scales indefinitely |
The extracted facts are typically one short sentence each. Even a user with 200 sessions might yield 40 distinct memory items. Injecting the top 5 relevant ones costs a few hundred tokens at most.
The Mem0 Python API — core shape
The canonical workflow has two operations: add memories from a completed conversation, and search for relevant memories at the start of the next turn.
from mem0 import Memory
# Initialise — Mem0 manages the vector store and extraction LLM internally.
# Pass a config dict to customise the LLM provider or vector store backend.
# See current Mem0 docs for the full config schema.
m = Memory()
# --- After a conversation turn ---
# `messages` follows the standard OpenAI-style chat format.
messages = [
{"role": "user", "content": "I'm vegetarian and I live in Berlin."},
{"role": "assistant", "content": "Got it — I'll keep that in mind!"},
]
# add() sends the exchange to the extraction LLM, which distills facts,
# then stores them in the vector store under the given user_id.
result = m.add(messages, user_id="alice")
# result contains the memory items that were written (ids, text, etc.)
# --- At the start of the NEXT session ---
user_message = "Can you suggest a restaurant near me for tonight?"
# search() embeds the query and returns the top-k most semantically
# relevant stored memories for this user.
relevant = m.search(query=user_message, user_id="alice")
# relevant is a list of memory objects; inject their text into the prompt.
memory_block = "\n".join(item["memory"] for item in relevant["results"])
system_prompt = f"""You are a helpful assistant.
What you know about this user:
{memory_block}
"""
# Now call your agent LLM with system_prompt + the current user_message.
A few things worth noting about this shape:
user_idis the key that scopes memories. Use a stable identifier — a database primary key works well.- The extraction step happens inside
add(). You do not need to manually decide which facts to store; the extraction LLM does that. search()returns ranked results — you can take[:3]or[:5]to stay within a token budget.- For agent-scoped or session-scoped memories, Mem0 also accepts
agent_idandrun_idparameters on both calls. Check the current docs for the exact keyword argument names, as these evolve across releases.
Updating and deleting memories
Mem0 tracks memory items by ID. When the extraction LLM detects a
contradiction — say, a user who previously said “I live in Berlin” now says
“I just moved to Amsterdam” — Mem0 can update the existing item rather than
create a duplicate. This conflict resolution is handled internally, but you
can also call m.update() or m.delete() explicitly if you need programmatic
control over what is stored.
Putting it together in an agent loop
A realistic agent session looks like this in pseudocode:
def agent_turn(user_id: str, user_message: str) -> str:
# 1. Retrieve relevant memories for this user
relevant = m.search(query=user_message, user_id=user_id)
memory_text = "\n".join(r["memory"] for r in relevant["results"][:5])
# 2. Build the prompt with injected memories
system = f"You are a helpful assistant.\n\nAbout this user:\n{memory_text}"
# 3. Call your agent LLM (any provider)
response = call_llm(system=system, user=user_message)
# 4. Store the new exchange as memories for future sessions
m.add(
[
{"role": "user", "content": user_message},
{"role": "assistant", "content": response},
],
user_id=user_id,
)
return response
The loop is the same regardless of which underlying LLM or framework you use. Mem0 is framework-agnostic — it works equally well with a raw OpenAI call, a LangChain chain, or a LangGraph agent.
Quick check
Practice this in an interview
All questionsAgents use short-term memory (the working context window) and long-term memory stored in vector databases or files, often split into episodic, semantic, and procedural memory. Context engineering is the discipline of curating what goes into the limited context window, and compaction summarizes or prunes older history so the agent retains key information without overflowing the window or degrading from too much noise.
An agent is an LLM placed in a loop where it reasons, chooses and calls tools or actions, observes the results, and repeats until a goal is met, rather than producing one response and stopping. The key differences are autonomy, tool use, memory and state, and multi-step control flow driven by the model's own decisions.
RAG augments an LLM by retrieving relevant documents from an external knowledge store at query time and feeding them into the prompt as grounding context. A basic pipeline chunks and embeds documents into a vector store, retrieves the top-k most similar chunks for a query, and the LLM generates an answer conditioned on them, reducing hallucination and keeping knowledge current.
cache() stores a DataFrame in executor memory using the default MEMORY_AND_DISK storage level. persist() lets you choose the storage level — memory-only, disk-only, serialized, or replicated. Use caching when a DataFrame is reused multiple times in the same application; without it, Spark recomputes the entire lineage from scratch on each action.