Weekly: the orchestration stack consolidates

The agentic memory reading path, 2 of 5. Here is a number that should make you suspicious.

On the benchmark that the MemGPT team built to prove their memory system worked, a plain, dumb baseline, just stuffing the entire conversation into the context window, scored 94.4%. The fancy memory system scored 94.8. Four-tenths of a point, for all that machinery. That number is not an embarrassment. It is a clue. It tells you that the benchmark was too easy, that the real problem lives somewhere the benchmark wasn’t looking, and that to build memory that earns its keep, you have to be ruthless about what you are actually measuring. This episode is about the systems people deploy to give agents a memory,

the three architectures on our reading path, and the industry that has grown up around them. Welcome back to the agentic memory deep dive. This is episode 2, the memory stack. Last episode, we built the scaffolding, the CoALA vocabulary, the storage-to-experience arc, the evaluation anxiety.

Today we get our hands dirty with three real systems, in roughly the order our reading path presents them. Zep, a temporal knowledge graph from a commercial vendor. A-MEM, a research system built on an unlikely inspiration, a German note-taking method from the 20th century. And Mem0, a system built explicitly for production deployment at scale. Then we widened out the scope of the system, and we went out to the vendor landscape and the build versus buy decision that every team building agent now has to make.

The through line for this episode is a tension you will feel in all three systems and in the market around them. On one axis, how much structure should memory have? Flat text on one end, rich knowledge graphs on the other. On the other axis, how much should you spend to maintain it?

Because every bit of structure you add, every entity you extract, every graph edge you resolve, costs a model call, adds latency, and creates a new way to be wrong. Let us watch three teams make that trade differently.

Start with Zep, from a paper by Preston Rasmussen and colleagues at the company of the same name, built around an open source engine they call Graphiti.

Zep’s bet is that the missing ingredient in agent memory is time, and the way you capture time is a temporally aware knowledge graph. The architecture has three tiers, and the structure is worth understanding because it is a complex structure. The structure is a clean realization of the CoALA split from last episode.

At the bottom, an episode subgraph, the raw input data, messages, text, JSON, stored losslessly. This is the immutable record, the ground truth. On top of that, a semantic entity subgraph, the entities and the relationships between them, extracted from the episodes by a language model.

And at the top, a community subgraph, clusters of strongly connected entities, each with a high level summary, giving the system a global view of the system, giving the system a global view of the domain. Raw episodes at the bottom, extracted semantic facts in the middle, summarized communities at the top. The authors explicitly note that this dual storage, raw episodic data alongside derived semantic structure, mirrors the psychological distinction between episodic and semantic memory. CoALA in production. But the real innovation, the thing Zep is actually selling, is what they call bitemporal modeling. Every fact in the graph carries two timelines, Timeline T is the chronological order of events in the world. When did this thing actually become true? Timeline T prime is the transactional order. When did the system learn about it? Keep both, and you can do something vector stores famously cannot. You can handle a fact that changes. When new information contradicts an existing edge, Zep uses a model to detect the conflict and invalidates the old edge by stamping it with an expiry time, rather than deleting it, or letting both end up. Zep uses a model to detect the conflict and invalidates the old edge by stamping it with an expiry time, rather than deleting it, or letting both end up. Zep uses a model to detect the conflict and validates the old edge by stamping it with an expiry time, rather than deleting it, or letting both end up. When both versions sit there equally valid, the old fact is still there, marked as having been true from this date to that date. The agent stops being confused about what is currently true. The construction side has details worth knowing, because they’re where the cost and the failure modes live. When Zep ingests a message, it extracts entities, then runs an entity resolution step, embedding each name, and doing both a similarity search and a full text search against existing entities. existing entities to decide whether this is a new entity or a duplicate of one already in the graph. It uses a reflection technique borrowed from the reflection work to cut hallucinations during extraction. It builds communities not with the heavy Leiden algorithm but with label propagation, specifically because label propagation can be updated incrementally as new data arrives instead of recomputing the whole community structure every time. Every one of those steps is a language model call, which is the hidden tax of the graph approach, and Zep’s engineering is largely about paying that tax as rarely as possible. The retrieval side is a useful template, too. Zep runs three search methods in parallel cosine semantic similarity for meaning, Okapi BM25 full text for exact words, and breadth-first graph search for contextual neighbors. Each targets a different kind of similarity, semantic, lexical, and structural nodes that sit closely together in the graph. Then it re-ranks, and the menu of re-rankers is itself instructive. Reciprocal rank fusion, maximal marginal relevance, a graph distance re-ranker that favors facts near a chosen node, an episode mentions re-ranker that boosts frequently referenced facts, and at the top of the cost curve, a cross-encoder that scores every candidate against the query with full attention. This multi-channel then re-rank pattern is, as we will keep seeing, the production default. And the deeper design point. Zep stores raw episodes and derived semantic facts side by side, which the authors explicitly say mirrors how human memory keeps distinct events and general associations as separate but linked systems. Keep the raw, derive the structure, link them. That phrase will be the moral of the whole series. The results.

On the deep memory retrieval benchmark, Zep posts 94.8% against MemGPT’s 93.4. Marginal. And the authors are refreshingly honest that the benchmark is the problem. Each conversation is only 60 messages, easily fitting in a modern context window, so a full context baseline nearly ties it. The real story is the harder benchmark, LongMemEval, with conversations averaging 115,000 tokens. There, Zep improves accuracy by up to 18.5%, while cutting response latency by… around 90% because instead of feeding 115,000 tokens to the model every turn it retrieves about 1,600. That is the actual pitch, not more accurate on a toy task but comparable or better accuracy at a fraction of the tokens and latency on a realistic one. The second system takes a completely different inspiration. A-mem by Woojong Shu and colleagues builds its memory on the Zettelkasten method, the slip box note taking system associated with the Zettelkasten method. with the sociologist Niklas Luhmann, who used it to write an absurd number of books. The core idea of a Zettelkasten is that the value is not in the individual notes. It is in the links between them, and that the network reorganizes itself as it grows. A-MEM applies that to agent memory. When a new memory is added, the system does not just file it. It generates a structured note with contextual descriptions, keywords, and tags. Then it analyzes the existing memories, finds ones with meaningful similarity, and establishes links.

So the memory store is an interconnected network, not a flat list, and not a rigid, predefined schema. The part that makes it genuinely agentic, and the part worth dwelling on, is memory evolution. When a new memory comes in and links to older ones, it can trigger updates to those older memories, revising their contextual descriptions and attributes in light of the new information.

The network refines its own understanding over time. Picture it. You tell the agent in March you’re learning guitar, and in May you mention you’ve joined a band.

A flat store just appends the band fact. A-MEM, in principle, goes back and enriches the guitar memory with the new context, the two notes now linked and mutually informed.

This is the storage-to-experience arc from last episode, made concrete. A-MEM is reaching for the reflection and experience stages, where memory is not a static archive, but something that reorganizes, as the memory store is. The authors tested across six foundation models, and reported consistent improvement over prior state-of-the-art memory systems, and notably the gains held across both small and large models, suggesting the benefit comes from the organization scheme itself, rather than from a single capable model carrying it. There is a cost to the freedom, though, and it’s the mirror image of Zep’s. Zep’s structure is rigid, but predictable.

A-MEM structure is flexible, but immeasurable. Zep’s structure is emergent, which means its behavior is harder to audit, and its self-rewriting carries exactly the useful memories become faulty risk will keep circling. Every time the agent revises an old note, it can also corrupt it. Flexibility and trustworthiness are intention, and A-MEM sits firmly on the flexibility side.

Now, hold A-MEM next to Zep’s, and you can see the philosophical split in the whole field. Zep’s imposes structure, a defined three-tier graph, explicit entity and edge types, a formal bitemporal model. A-MEM grows structure, emergent links, self-organizing notes, evolution driven by the agent rather than a fixed schema. Both are knowledge graphs, in some loose sense. They are almost opposite design philosophies. And here is the dissent worth flagging, the one we’ll return to in episode 3.

There is a growing argument in the field that the whole industry took a wrong turn by converging on entity-relationship graphs and atomic facts at all. That extracting clean little facts from messy conversation is lossy adds a hallucination-prone model step, and that some agents would be better served keeping the raw narrative.

A-MEMs keep evolving the notes, and the contrarian just keep the raw trace are both reactions to the same worry, that aggressive structure throws away something you needed. The third system is the most explicitly commercial, and that is the point.

Mem0 by Pratik Chakara and colleagues puts the word, production-ready and scalable, right in the title. Its pipeline is the one most teams will recognize. Dynamically extract salient information from the ongoing conversation, consolidate it, and retrieve it on demand. There is a base version and a graph-enhanced variant that adds relational structure. What makes Mem0 worth studying is not a novel data structure, it is the relentless focus on the production metrics that research papers usually ignore. They evaluate on the LoCoMotivity of the data structure, and they evaluate on the LoCoMotivity of the data structure. They evaluate on the LoCoMotivity of the data structure. They evaluate on the LoCoMotivity of the data structure. They evaluate on the LoCoMotivity of the data structure. They evaluate on the LoCoMotivity of the data structure. They evaluate on the LoCoMotivity of the data structure. They evaluate on the LoCoMotivity of the data structure. They evaluate on the LoCoMotivity of the data structure. And the headline numbers are about cost, as much as accuracy. Mem0 reports a 26% relative improvement over OpenAI’s memory on a language model as judge metric. But then, a 91% lower p95 latency, and more than 90% token cost savings versus the full context approach. Sit with why those numbers are the real product. At 100 users, you can afford to stuff the whole history into context. At 100,000 users, each paying you nothing or close to it, a 90% token reduction is the difference between a viable business and a bonfire of API credits. Mem0’s contribution is to take memory seriously as a systems problem with a cost model, not just an accuracy problem with a leaderboard. The graph variant, notably, adds only about 2% overall accuracy over the base, which is itself an honest data point about how much the heavy structure actually buys you on this benchmark. So across our three systems, you have three answers to the structure versus cost question.

Zep, maximal structure, justified by temporal reasoning. A-MEM, emergent, self-evolving structure, justified by adaptability.

Mem0, lean structure, justified by cost at scale. None is the universal right answer. They are points on a trade-off curve, and which one fits depends on whether your problem is dominated by changing facts, by open-ended learning, or by the bill. Before the market, let’s get systematic, because there’s a field companion to all this research, a survey built from engineering write-ups and product launches rather than papers, and it organizes the whole space by design decision. 11 of them. You don’t need all 11, but a handful are the ones every team actually trips over, and they map cleanly onto the three systems we just covered.

Retrieval and ranking. The decision, one vector index or multiple parallel channels fused together. The emerging production answer, the one Zep implements and Cloudflare shipped, is multi-channel with reciprocal rank fusion, not a single cosine lookup. And the sharp warning underneath it, semantic closeness, is not relevance. Cosine similarity will cheerfully hand you something near your query in embedding space that is stale, or about the wrong user, or topically adjacent but useless, while missing the fact that actually mattered because it wasn’t phrased the way the query was.

Consolidation and distillation. The decision, do you run a model on every turn to extract memory or batch it lazily? Eager per-turn extraction is the single biggest cost driver in these systems. Lazy. Lazy consolidation cuts the bill, but adds staleness. And the hard-won rule, reported independently by Slack’s engineers, and by a research paper bluntly titled Useful Memories Become Faulty. When a model continuously rewrites its own memory, the memory degrades, drift, context collapse, detail sanded off.

So keep the raw trace as ground truth, and treat the distilled version as a fallible, rebuildable layer. Exactly the lesson Zep’s dual storage encodes, and exactly the lesson that will detonate in the future. In episode 3, when we find raw trajectories beating distilled skills.

Temporality. The decision. When a fact changes, do you supersede version or silently overwrite? This is the number one field complaint about plain vector stores. They have no notion of supersession, so the old and new facts sit there equally retrievable, and the agent gets confused about what’s true now. Bitemporal modeling, Zep’s whole identity, is the answer builders switch backends to get, substrate. The decision, do you even need a vector database? The contrarian, deliberately boring answer from working engineers is often no. That SQLite with full text search over a transcript store goes remarkably far, and that Git plus object storage as the memory layer gives you audit friendliness for free. Justify the heavy store before you reach for it. And working memory and context. The reminder that long term memory is just one of roughly seven things competing for the context window every step. So you cannot design memory in isolation from the broader context budget. Those five, retrieval, consolidation, temporality, substrate and context, are the decisions that separate a memory system that survives contact with production from one that quietly rots.

Step out of the papers and into the market, because this is where state of industry actually lives. The managed memory market formed astonishingly fast across 2025 and 2026, almost in parallel with the surveys that we’ve seen in the last decade and a half. So you cannot design memory in isolation from the broader context budget. Those five, retrieval, consolidation, temporality, substrate and context, are the decisions that we’ve seen in the last decade and a half. So you cannot design memory in isolation from the broader context budget. Those five, retrieval, consolidation, temporality, substrate and context, are the decisions that we’ve seen in the last decade and a half. The field companion to this research, a survey built from engineering write ups and product launches, lays out the landscape. Mem0, Letta, which grew out of the MemGPT work, Cognee, Zep, with Graphiti, MemoryOS, and more arriving constantly. Cloudflare shipped agent memory with exactly the multi channel plus reciprocal rank fusion retrieval pattern we saw in Zep. And shared team memory profiles became a headline feature. And here you see that M exercised 이게 он Viking المค versão escorting, Nano e involucrando a kemik that мы جمعا here is the catch that should shape any build versus buy decision. Every one of these frameworks ships its own bespoke storage and its own vocabulary. There is no shared wire format, which means migrating your memory from one framework to another today essentially means rebuilding from scratch. You are not choosing a library. You are choosing a representation, a retrieval strategy, and a set of governance choices, and you are marrying them. The principle the practitioners keep arriving at, memory quality equals schema quality, and if you can’t see or move the schema, you can’t really own it. There is also a quieter contrarian movement in the industry that deserves airtime because the vendor pitch can make it sound like you must adopt a heavy memory service. The counter position voiced by working engineers is that many agents do not need a vector database at all. SQLite with full text search over a stored transcript goes remarkably far. Git plus object storage as the memory layer is a real pattern. Keep the immutable transcripts cheaply, derive memory on demand, and get audit friendliness for free.

And a related warning, some frameworks advertised as local still phone home to a cloud model for the extraction step, so local and private is a claim to verify, not assume, especially if privacy was the whole reason you reached for it. There is one more production lesson the field reports keep repeating, and it is blunt. Just add a vector database breaks once the agent runs for a while. The store accumulates, retrieval gets polluted, the agent starts repeating mistakes, and drifting. Which is a perfect setup for the rest of the series, because every failure on that list, drift, staleness, repeated mistakes, is a memory problem. The three systems we covered are each trying in their own way to solve. So how should you read the memory stack, having opened up three systems and the market around them? First, the structure question is the load-bearing one, and it has no default answer. Graph, vector, atomic facts, evolving notes, raw transcript. Each buys you something and costs you something.

Zep’s graph buys temporal reasoning at the cost of an extraction step. Mem0’s leaner approach buys cost savings at the cost of relational richness.

Ask what your actual failure mode is before you pick. Second, time is the feature builder’s most consistently underrate and most consistently switch backends to get. Zep made bitemporal modeling its whole identity for a reason. If your domain has facts that change, and almost every real domain does, a memory with no notion of supersession will quietly poison itself. Third, measure the system, not the demo. Mem0’s contribution is mostly that it reported P95 latency and token cost, the numbers that decide whether you can actually ship. A memory system that looks great on a five-message demo, and falls over the top of the list, is a memory system. It’s not just a five-message demo, it’s a five-message system. It’s a five-message system. A five-message demo, and falls over at 100,000 users, has told you nothing useful.

And fourth, plan for the exit before you enter. No shared wire format means the framework you pick today is one you may be stuck rebuilding out of later. Keep your raw transcripts in something portable and boring, so that whatever clever memory layer you put on top is a derived, rebuildable thing rather than your only copy of the truth.

That last point, keep the raw, treat the clever, layer as fallible, is going to come back with a vengeance in episode three, because it turns out the same lesson governs not just facts, but skills. Three systems, three philosophies. Zep bets on time and structure. A-MEM bets on emergent, self-organizing networks. Mem0 bets on lean memory and the production cost model. And the market around them is fast, fragmented, and locked in by the lack of any shared format.

Next episode, we move from remembering facts, to remembering how to do things. Procedural memory and skill libraries, from Voyager building a library of executable skills in Minecraft, to agent workflow memory inducing reusable routines for the web, to a brand new benchmark that delivers the most uncomfortable finding in the field. That raw experience often beats the polished skill you distilled from it. That one reshapes how you should think about every coding agent you use. See you there.

Weekly: the orchestration stack consolidates

Highlights

Segment 1 — Typed contracts win

Segment 2 — Coding-agent benchmarks grow up

Segment 3 — Orchestration as a pipeline

Segment 4 — Retrieval evals under fire

Transcript

In this issue