What types of memory do agents use, and what is context engineering and compaction?

Agents use short-term memory (the working context window) and long-term memory stored in vector databases or files, often split into episodic, semantic, and procedural memory. Context engineering is the discipline of curating what goes into the limited context window, and compaction summarizes or prunes older history so the agent retains key information without overflowing the window or degrading from too much noise.

What is an AI agent, and how does it differ from a single LLM call?

An agent is an LLM placed in a loop where it reasons, chooses and calls tools or actions, observes the results, and repeats until a goal is met, rather than producing one response and stopping. The key differences are autonomy, tool use, memory and state, and multi-step control flow driven by the model's own decisions.

When the KV cache doesn't fit in GPU VRAM, what are your options?

The KV cache is working memory — it's re-read to generate every token — so it has to stay fast. When VRAM fills, you offload the least-active sessions down a memory hierarchy: GPU VRAM (active, ~3 TB/s), CPU RAM over PCIe (idle, ~50 GB/s), local SSD (long-idle), and networked storage (cold/durable only, never live decode). Idle sessions are parked lower and promoted back to VRAM on activity. The alternative is to drop the cache and recompute the prefill when the session returns; for long prompts, offloading and reloading usually beats recomputing attention over thousands of tokens.

How do state-space models like Mamba differ from attention, and when would you use one?

A state-space model carries a fixed-size hidden state forward through the sequence like a selective recurrence, giving O(N) time and constant per-step memory with no KV cache that grows with context. Attention instead compares every token to every other, which is O(N^2) but allows exact lookup of any past token. Mamba's gates are input-dependent, recovering much of attention's content-awareness; the trade-off is that a fixed state can't recall arbitrary far-back tokens as precisely. In practice, hybrids that interleave Mamba layers with a few attention layers give near-linear cost with near-attention quality.

Mem0 — a memory layer for agents — Agentic AI

The core problem: LLMs are stateless

Every call to a language model starts fresh. The model has no persistent state of its own — whatever context you send is all it knows about the world beyond its training data. This means “memory” is always your responsibility as the developer: you decide what goes into the prompt, and you pay (in tokens and latency) for every word of it.

The naive approach is context stuffing — prepending the full conversation history to every new turn. It works for short sessions but breaks down quickly:

Cost scales linearly. Ten prior sessions of 2,000 tokens each means 20,000 tokens in the context before the user even types their first word.
Relevance collapses. A month-old discussion about a cancelled order is almost never relevant to today’s shipping question, yet it still occupies prime context real estate.
Context limits bite. Even with very large windows, there is a ceiling, and retrieval quality often degrades long before you reach it.

What you actually want is selective long-term memory: keep the facts that matter, discard the noise, and inject only what is relevant for the current turn.

What Mem0 is

Mem0 is a memory layer — a service that sits between your agent and the underlying LLM, handling the extraction, storage, and retrieval of salient facts across sessions. The core idea:

After (or during) a conversation turn, Mem0 passes the exchange to an LLM that extracts discrete memory items — compact facts like “User prefers vegetarian food” or “User’s timezone is Asia/Kolkata”.
Those items are stored in a vector store (Mem0 manages this for you, or you can point it at your own).
On the next turn, before you call your agent LLM, you query Mem0 with the current message. It performs a semantic search over stored memories and returns the top-k most relevant ones.
You inject those memories into your system prompt. The agent sees only what is relevant — not the full history.

The extraction step is the key innovation. Rather than storing raw conversation turns, Mem0 distills them into a structured fact graph that can be searched efficiently and stays compact even after hundreds of sessions.

Memory types

Mem0 organizes memories along three axes, which map directly to the taxonomy from the broader agent memory literature:

Type	Scope	Example
User memory	Tied to a specific user ID	”Prefers dark mode”, “Vegetarian”
Agent memory	Tied to the agent itself	”Always greet users by first name”
Session memory	Scoped to a single run	”Currently troubleshooting login issue”

In practice, user memory is the most commonly used — it is what lets your agent feel like it knows a returning customer.

The architecture at a glance

Top row: a conversation turn is distilled into facts and stored. Bottom row: the next session retrieves only relevant memories and injects them into context.

Why this beats naive context-stuffing

Approach	Cost per turn	Relevance	Session limit
Full history replay	O(total tokens)	Low — everything in	Hits window fast
Mem0 memory layer	O(k extracted facts)	High — semantic match	Scales indefinitely

The extracted facts are typically one short sentence each. Even a user with 200 sessions might yield 40 distinct memory items. Injecting the top 5 relevant ones costs a few hundred tokens at most.

The Mem0 Python API — core shape

The canonical workflow has two operations: add memories from a completed conversation, and search for relevant memories at the start of the next turn.

from mem0 import Memory

# Initialise — Mem0 manages the vector store and extraction LLM internally.
# Pass a config dict to customise the LLM provider or vector store backend.
# See current Mem0 docs for the full config schema.
m = Memory()

# --- After a conversation turn ---
# `messages` follows the standard OpenAI-style chat format.
messages = [
    {"role": "user",      "content": "I'm vegetarian and I live in Berlin."},
    {"role": "assistant", "content": "Got it — I'll keep that in mind!"},
]

# add() sends the exchange to the extraction LLM, which distills facts,
# then stores them in the vector store under the given user_id.
result = m.add(messages, user_id="alice")
# result contains the memory items that were written (ids, text, etc.)

# --- At the start of the NEXT session ---
user_message = "Can you suggest a restaurant near me for tonight?"

# search() embeds the query and returns the top-k most semantically
# relevant stored memories for this user.
relevant = m.search(query=user_message, user_id="alice")

# relevant is a list of memory objects; inject their text into the prompt.
memory_block = "\n".join(item["memory"] for item in relevant["results"])

system_prompt = f"""You are a helpful assistant.

What you know about this user:
{memory_block}
"""

# Now call your agent LLM with system_prompt + the current user_message.

A few things worth noting about this shape:

user_id is the key that scopes memories. Use a stable identifier — a database primary key works well.
The extraction step happens inside add(). You do not need to manually decide which facts to store; the extraction LLM does that.
search() returns ranked results — you can take [:3] or [:5] to stay within a token budget.
For agent-scoped or session-scoped memories, Mem0 also accepts agent_id and run_id parameters on both calls. Check the current docs for the exact keyword argument names, as these evolve across releases.

It has to retrieve the relevant fact (the easy part — semantic search) and reconcile the contradiction so the new statement replaces the old rather than sitting beside it. The second is the hard one, and it’s what the next section is about.

Updating and deleting memories

Mem0 tracks memory items by ID. When the extraction LLM detects a contradiction — say, a user who previously said “I live in Berlin” now says “I just moved to Amsterdam” — Mem0 can update the existing item rather than create a duplicate. This conflict resolution is handled internally, but you can also call m.update() or m.delete() explicitly if you need programmatic control over what is stored.

Putting it together in an agent loop

A realistic agent session looks like this in pseudocode:

def agent_turn(user_id: str, user_message: str) -> str:
    # 1. Retrieve relevant memories for this user
    relevant = m.search(query=user_message, user_id=user_id)
    memory_text = "\n".join(r["memory"] for r in relevant["results"][:5])

    # 2. Build the prompt with injected memories
    system = f"You are a helpful assistant.\n\nAbout this user:\n{memory_text}"

    # 3. Call your agent LLM (any provider)
    response = call_llm(system=system, user=user_message)

    # 4. Store the new exchange as memories for future sessions
    m.add(
        [
            {"role": "user",      "content": user_message},
            {"role": "assistant", "content": response},
        ],
        user_id=user_id,
    )

    return response

The loop is the same regardless of which underlying LLM or framework you use. Mem0 is framework-agnostic — it works equally well with a raw OpenAI call, a LangChain chain, or a LangGraph agent.

In one breath

LLMs are stateless — every call starts fresh, so “memory” is the developer’s job, and naive context-stuffing (replay the whole history) blows up cost, buries relevance, and hits the window ceiling.
Mem0 is a memory layer: an extraction LLM distills each exchange into discrete facts, stores them as vectors, and at the next turn a semantic search returns only the top-k relevant ones to inject.
It scopes memory by user_id (most common), plus optional agent_id / run_id; add() and search() are the two core calls.
On contradictions it updates in place by item ID rather than duplicating — and you can also update() / delete() explicitly.
Memory is personal data that goes stale — build review/expiry, let users see and delete what’s stored, and treat it with full data-governance rigour.

Quick check

0/3

Q1Why does injecting the full conversation history into every prompt fail at scale?

Q2In Mem0, what does the extraction step produce?

Q3You are building a travel-planning agent. A user told it six months ago that they hate layovers. Today they are booking a round-the-world trip with tight scheduling and may now accept a short connection. Which Mem0 concern does this scenario illustrate?

Durable memory is powerful — and a target. The moment memory persists, an attacker wants to write to it; agent security covers prompt injection and the least-privilege controls that contain it.

Mem0 — a memory layer for agents

What you'll learn

Before you start

The core problem: LLMs are stateless

What Mem0 is

Memory types

The architecture at a glance

Why this beats naive context-stuffing

The Mem0 Python API — core shape

Updating and deleting memories

Putting it together in an agent loop

In one breath

Quick check

Quick check

Next

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further