Open thread
openHow should agents remember across long horizons?
An agent with no memory re-derives the world every turn; an agent that remembers the wrong things compounds its own mistakes instead. Most of the open work sits between those two failures: what to store, when to consolidate, and what to let the system forget. I mapped that literature into an explorer of 108 papers across nine themes and a five-part podcast, so the ideas stay learnable rather than only citable. The papers below are two recent threads I'm still pulling on.
The work
- Agentic Memory Systems Learning
- Agent Memory — Design Considerations Learning
- Enterprise Multi-Agent Reliability Learning
- mem Project
- Foundations Learning
Reading path
A generated path through 2 papers — assembled using her SciX literature tools (semantic embeddings, citation graphs, and reading-order signals).
-
CLEAR: Context Augmentation from Contrastive Learning of Experience via Agentic Reflection ↗
Memory as built from the agent's own experience, via contrastive reflection.
-
SEARL: Joint Optimization of Policy and Tool Graph Memory for Self-Evolving Agents ↗
Memory as a graph of tools, optimized jointly with the policy.
Open literature
From my ADS library
- Voyager: An Open-Ended Embodied Agent with Large Language Models ↗
- Reflexion: Language Agents with Verbal Reinforcement Learning ↗
- MemGPT: Towards LLMs as Operating Systems ↗
- A-MEM: Agentic Memory for LLM Agents ↗
- Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory ↗
- Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection ↗
Pulled from her curated ADS library.
From my Agentic Memory Systems
-
Unified multi-type white-box harness
Nothing spans semantic+episodic+procedural with stage-attributed diagnostics, explicit scoring targets (TIAP), fixed-answerer controls (EngramaBench), cross-system eval (GRAVITY) and CIs. Flagship open-source contribution.
-
Canonical procedural-memory benchmark
SkillEvolBench/SEA-Eval are days-old & unconsolidated; adopt the freeze-then-deploy arc + No-Skill / Raw-Trajectory attribution controls ('Harness Updating ≠ Benefit').
-
Forgetting / obsolescence metric suite
Thinnest area: STALE (55.2% ceiling) + AMC continual-RL metrics are the only hard numbers; standardize a retention-curve + obsolescence-precision/recall protocol for textual memory.
-
Synthetic-data realism metric
OmniBehavior/REALTALK prove the gap; build a fidelity score + controllable conflict/distractor injection (MemConflict recipe) + QDC auditing.
-
Shared memory-security harness
RSR@k / ASR / Benign + over-refusal + cross-user leakage + provenance-violation, covering episodic/procedural stores (today's attacks only hit semantic).
-
Unbenchmarked production axes
Multi-user isolation, proactive memory use, multimodal long-horizon recall.