How should agents remember across long horizons?

An agent with no memory re-derives the world every turn; an agent that remembers the wrong things compounds its own mistakes instead. Most of the open work sits between those two failures: what to store, when to consolidate, and what to let the system forget. I mapped that literature into an explorer of 108 papers across nine themes and a five-part podcast, so the ideas stay learnable rather than only citable. The papers below are two recent threads I'm still pulling on.

agent-memoryagentsretrieval

The work

Agentic Memory Systems Learning
Agent Memory — Design Considerations Learning
Enterprise Multi-Agent Reliability Learning
mem Project
Foundations Learning

Reading path

A generated path through 2 papers — assembled using her SciX literature tools (semantic embeddings, citation graphs, and reading-order signals).

CLEAR: Context Augmentation from Contrastive Learning of Experience via Agentic Reflection ↗

2026

Memory as built from the agent's own experience, via contrastive reflection.
SEARL: Joint Optimization of Policy and Tool Graph Memory for Self-Evolving Agents ↗

2026

Memory as a graph of tools, optimized jointly with the policy.

Open literature

Voyager: An Open-Ended Embodied Agent with Large Language Models ↗ Wang, Guanzhi et al. · 2023 · 1354 cites
Reflexion: Language Agents with Verbal Reinforcement Learning ↗ Shinn, Noah et al. · 2023 · 1062 cites
MemGPT: Towards LLMs as Operating Systems ↗ Packer, Charles et al. · 2023 · 501 cites
A-MEM: Agentic Memory for LLM Agents ↗ Xu, Wujiang et al. · 2025 · 407 cites
Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory ↗ Chhikara, Prateek et al. · 2025 · 405 cites
Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection ↗ Greshake, Kai et al. · 2023 · 305 cites

Pulled from her curated ADS library.

From my Agentic Memory Systems

108 papers · 9 themes

Open explorer ↗ On the site

Procedural & SkillsReflection & ExperienceBenchmarksEval MethodologySynthetic DataArchitecturesSecurity & GovernanceApplications & PersonalizationForgetting & Consolidation

Unified multi-type white-box harness

Nothing spans semantic+episodic+procedural with stage-attributed diagnostics, explicit scoring targets (TIAP), fixed-answerer controls (EngramaBench), cross-system eval (GRAVITY) and CIs. Flagship open-source contribution.
Canonical procedural-memory benchmark

SkillEvolBench/SEA-Eval are days-old & unconsolidated; adopt the freeze-then-deploy arc + No-Skill / Raw-Trajectory attribution controls ('Harness Updating ≠ Benefit').
Forgetting / obsolescence metric suite

Thinnest area: STALE (55.2% ceiling) + AMC continual-RL metrics are the only hard numbers; standardize a retention-curve + obsolescence-precision/recall protocol for textual memory.
Synthetic-data realism metric

OmniBehavior/REALTALK prove the gap; build a fidelity score + controllable conflict/distractor injection (MemConflict recipe) + QDC auditing.
Shared memory-security harness

RSR@k / ASR / Benign + over-refusal + cross-user leakage + provenance-violation, covering episodic/procedural stores (today's attacks only hit semantic).
Unbenchmarked production axes

Multi-user isolation, proactive memory use, multimodal long-horizon recall.

How should agents remember across long horizons?

The work

Reading path

CLEAR: Context Augmentation from Contrastive Learning of Experience via Agentic Reflection ↗

SEARL: Joint Optimization of Policy and Tool Graph Memory for Self-Evolving Agents ↗

Open literature

From my ADS library

From my Agentic Memory Systems

Explore the links