The Knowledge Hub

AI Memory

Every AI agent forgets at the end of every session. What you build to prevent that shapes how well the system works six months in.

11 min read · Knowledge Hub module · by Kenny

Last reviewed June 2026

What this edition covers

Initial June 2026 edition — foundational concepts established
Anthropic memory tool (type: memory_20250818) documented as production-grade filesystem primitive
Mem0 April-2026 benchmarks summarised: 92.5 on LoCoMo, 94.4 on LongMemEval, 26% improvement over OpenAI built-in memory
100K-token cost inflection point named with concrete figures: ~6,900 tokens (Mem0) vs ~26,000 tokens (full-context), 91% latency reduction
Six open production gaps catalogued from Mem0 State of AI Agent Memory 2026

Every AI agent forgets at the end of every session. That’s not a flaw in any particular model or product — it’s the structural reality of how current AI systems work. The context window resets with every API call. Anything the agent learned, decided, or noticed during a session disappears when the session closes, unless the infrastructure around it explicitly persists that state.

For a single-session use of AI, this doesn’t matter much. For a system expected to run a brand’s operations alongside a team over months, it matters enormously. Memory is the mechanism that turns a capable AI tool into a system that actually improves over time.

The working memory boundary

Working memory is what an agent holds within its active context window during a single session: the current conversation, recently retrieved facts, task state. It resets with every API call. Persistent memory is information stored outside the context window in a durable medium — files, vector databases, graph stores, relational databases — and retrieved either at session start or on demand during a session.

The boundary between them is hard. There is no automatic promotion from working to persistent. The agent, or its infrastructure, must explicitly write state out before the context closes. This is a design decision, not an emergent property of the system.

As of mid-2026, context windows are large: Claude Sonnet 4.6 and Gemini 2.5 Pro both support 1M tokens.¹ These are large enough to matter, but three forces make them insufficient as the sole memory mechanism for operational agents. First, cost: holding a 1M-token context in the KV cache consumes substantial GPU memory per request — from tens to well over a hundred gigabytes depending on the model and its precision — which makes full-context approaches expensive at operating scale.¹ Second, retrieval degradation: Stanford’s “lost in the middle” research showed that LLMs perform significantly worse at retrieving information placed in the middle of long contexts compared to information at the beginning or end — longer contexts don’t scale linearly in quality.² Third, temporal span: operational workflows run for months. No context window spans that. Cross-session memory requires persistent memory by definition.

The boundary cannot be wished away. It can only be designed for.

What RAG does and doesn’t do

Retrieval-Augmented Generation (RAG) is the dominant pattern for grounding LLM responses in stored information. Source documents are chunked, encoded as vector embeddings, indexed in a vector database, and retrieved by similarity at inference time. Retrieved content is injected into the prompt; the LLM generates a response conditioned on both its parametric knowledge and the retrieved material.

RAG works well for the problem it was designed to solve: document retrieval. Applied to agent memory, it’s necessary but structurally insufficient. There are three gaps, and they’re architectural, not implementation details.

RAG cannot update state. When a preference changes, a RAG system stores both the old and new preference as vectors. Retrieval surfaces both; the agent must arbitrate. This is memory accumulation, not memory management.

RAG retrieves by similarity, not truth or recency. The nearest-neighbour search surfaces semantically similar content regardless of whether that content is current, superseded, or contradicted by later information. A fact that was true six months ago retrieves just as well as one that’s true today.

RAG has no temporal model. Without explicit timestamp indexing and recency weighting, RAG-as-memory cannot answer “what did the user say most recently about X?” or “has this preference changed over time?”

RAG retrieves; it doesn’t remember. Production agent memory requires a management layer on top of retrieval — a system that extracts structured facts, resolves conflicts, tracks provenance, and manages staleness.

This is why every production-grade memory framework — Mem0, MIRIX, LangGraph Memory, Anthropic’s memory tool — builds management primitives on top of vector retrieval rather than replacing it. Retrieval is the retrieval layer; it still needs to be managed.

The cost inflection point

There’s a specific threshold where the cost economics of persistent memory change. Research published at arXiv found that “at a context length of 100,000 tokens, the memory system becomes cheaper after approximately ten interaction turns.”³ Below that threshold, passing full conversation history to a long-context model is cost-competitive and often more accurate. Above it, structured memory systems become materially cheaper, and the gap widens as context grows.

The Mem0 architecture — an open-source memory framework benchmarked across the LoCoMo, LongMemEval, and BEAM evaluation suites — makes the arithmetic concrete.⁴ At the same conversation length, Mem0 consumes approximately 6,900 tokens per query versus 26,000 tokens for full-context approaches. That’s roughly 73% fewer tokens per query. Latency drops by approximately 91% at the 95th percentile.

For a system expected to run for months across an engagement, the 100K-token threshold arrives quickly in cumulative conversation history. Without structured memory, the choice is either to pay the cost-and-latency penalty of full-context approaches, or to experience the agent dropping earlier context to manage cost. Both outcomes are worse than designing for persistent memory from the start.

There’s a secondary lever here: prompt caching. Major providers heavily discount cached tokens. For agents with stable system prompts and repeatable retrieval patterns, prompt caching materially changes the per-call economics. Storage costs for a full engagement’s episodic memory are measured in megabytes and dollars per year. Inference costs for the same information, passed as raw context on every call, compound much faster. Storage is not the cost problem; inference is.

A taxonomy that helps in practice

Not all memory serves the same function. The MIRIX research (arXiv:2507.07957) proposed a six-type taxonomy for multi-agent systems that maps cleanly to what production teams actually need to store.⁵

Core memory holds persistent agent identity and key user facts — the persona block and the human block. It compacts at capacity. Episodic memory stores time-stamped events and interactions: what happened, when, who was involved, and a structured summary. Semantic memory stores abstract facts and entity relationships with source attribution. Procedural memory stores goal-directed workflows and task sequences — the agent’s accumulated know-how. Resource memory stores references to external documents and files. Knowledge vault holds sensitive data with sensitivity tagging.

The taxonomy matters because different agent types need different memory types. A validator agent running gate checks needs procedural memory — past gate outcomes and known failure patterns — more than episodic memory. A creator agent needs semantic memory heavily: client voice corrections, approved-versus-rejected copy patterns, lexicon enforcement history. A coordinator agent needs strong in-session working memory to track multi-agent orchestration state, plus episodic memory for handoff continuity. Treating all of these as one undifferentiated store degrades retrieval quality for all of them.

MIRIX, on its own evaluation, reported 85.4% accuracy on LoCoMo — outperforming the next strongest baseline in that test by 8 points — and 35% improvement over RAG baselines on a multimodal task, with 99.9% storage reduction versus RAG for that task.⁵ The multi-agent model (eight specialised agents managing memory retrieval and update in parallel) is the notable architectural contribution for teams running systems with more than a handful of agents.

What shipped in 2025–2026

The major vendors shipped first-generation persistent memory infrastructure in 2025. The architecture is still consolidating, but the production primitives now exist.

Anthropic — memory tool (beta). Claude Managed Agents now support a memory tool with type identifier memory_20250818.⁶ It’s a filesystem-based primitive: the agent reads from a designated memory directory at session start, writes back to it during the session, and the operator controls the actual storage layer — local files, database, cloud storage, or encrypted vault. Anthropic chose client-side storage deliberately, giving developers complete control over where and how data is persisted. This matters for teams with data sovereignty requirements. The tool also includes a research-preview feature called “Dreaming” that lets the agent review past sessions to find patterns and support self-improvement; updates can be automatic or require operator review before landing. That feature is explicitly experimental and should not be treated as production-ready behaviour.

OpenAI — cross-session memory extended (April 2025). ChatGPT’s memory feature was extended to reference all past conversations, not only explicitly saved memories.⁷ This is consumer-facing product memory rather than API-level agent memory; the architecture is not directly accessible to developers building on OpenAI’s Responses API. But the consumer-facing shift signals where the product is heading, and it changes what users of ChatGPT-based workflows can expect by default.

Open source. Mem0 published its production-ready architecture at ECAI 2025, establishing the first broad benchmark comparison across ten memory approaches.⁴ Mem0’s core mechanism is structured fact extraction from conversations — not raw storage — with graph-based representation for relational queries and multi-signal retrieval fusing semantic similarity, keyword matching, and entity matching into one score. Benchmark results from the April 2026 token-efficient algorithm: 92.5 on LoCoMo, 94.4 on LongMemEval, 26% relative improvement over OpenAI’s built-in memory on the LLM-as-a-Judge metric. LangMem (LangChain’s memory layer) underperformed Mem0 by approximately 8 points on LoCoMo in the MIRIX evaluation.

The remaining gap is honest and worth naming. BEAM benchmark results reveal a 24% performance decline from 1M to 10M tokens — temporal abstraction degrades at scale. No current open-source or commercial system fully resolves this. The Mem0 State of AI Agent Memory 2026 report catalogues six open production gaps: temporal abstraction at scale, cross-session state tracking, privacy architecture and consent models, identity resolution across sessions, memory staleness management, and application-level evaluation that translates benchmark scores to domain-specific performance.⁸

What this means for systems that run over months

For AI-operated workflows — systems that run a brand’s operations alongside a team, not just respond to queries — the memory architecture question is operational, not academic.

An agent that cannot remember what it flagged last quarter will repeat work. A content cadence agent that doesn’t retain client voice corrections will drift back toward generic output. A validator that has no record of past gate outcomes starts each engagement as if it were the first. The compounding value of a well-run system depends on memory working correctly.

Three design principles hold across the frameworks reviewed here. First, load selectively at session start — retrieve a minimal relevant subset, not the full memory store. Enough to orient the agent; not enough to overwhelm the context or trigger retrieval degradation. Second, write back selectively — decisions, corrections, preference updates, factual discoveries. Write what’s worth carrying forward; don’t log everything. Third, type the memory by function — separate episodic events from semantic facts from procedural workflows, and design retrieval differently for each. A procedural memory for a validator agent is architecturally different from an episodic memory for a creator agent. One undifferentiated store serves neither well.

There’s also a governance dimension. Memory is evidence. What an agent remembered, retrieved, and acted on is operationally significant for client-facing engagements. Memory reads should be logged alongside the actions they influenced. The source of a persistent memory matters: a fact recorded from a client interview carries different reliability than one inferred from a single session’s output. And retention policy must be explicit for client data: when an engagement ends, client-specific memories should be removable without contaminating the shared knowledge base that accumulates across engagements.

2026 is the first year where memory infrastructure is a deliberate vendor-architecture choice, not something a team has to build entirely from scratch. The choice still matters. The inflection point, the taxonomy, and the remaining gaps are all real. But the tools are now present. The work is in designing for them.

The sibling modules in this hub explore what memory makes possible: how knowledge compounds over time at Knowledge Flywheel, how agents coordinate at Orchestration, and how system costs scale at Costs.

The Knowledge Hub is re-checked as the research moves — each module carries a ‘what changed’ note and its last-reviewed date at the top. Follow the modules you care about; skip the ones you don’t.

Get monthly Notes from RTSN

Memory architecture is a first-class design decision. We build the systems that get it right.

Book a Discovery Call See how we work