Open thread

open

How should agents remember across long horizons?

An agent with no memory re-derives the world every turn; an agent that remembers the wrong things compounds its own mistakes instead. Most of the open work sits between those two failures: what to store, when to consolidate, and what to let the system forget. I mapped that literature into an explorer of 108 papers across nine themes and a five-part podcast, so the ideas stay learnable rather than only citable. The papers below are two recent threads I'm still pulling on.

agent-memoryagentsretrieval

The work

Reading path

A generated path through 2 papers — assembled using her SciX literature tools (semantic embeddings, citation graphs, and reading-order signals).

  1. CLEAR: Context Augmentation from Contrastive Learning of Experience via Agentic Reflection ↗

    2026

    Memory as built from the agent's own experience, via contrastive reflection.

  2. SEARL: Joint Optimization of Policy and Tool Graph Memory for Self-Evolving Agents ↗

    2026

    Memory as a graph of tools, optimized jointly with the policy.

Open literature

From my ADS library

Agent Memory 35 papers Browse all →

Pulled from her curated ADS library.

From my Agentic Memory Systems

108 papers · 9 themes
Procedural & SkillsReflection & ExperienceBenchmarksEval MethodologySynthetic DataArchitecturesSecurity & GovernanceApplications & PersonalizationForgetting & Consolidation
  • Unified multi-type white-box harness

    Nothing spans semantic+episodic+procedural with stage-attributed diagnostics, explicit scoring targets (TIAP), fixed-answerer controls (EngramaBench), cross-system eval (GRAVITY) and CIs. Flagship open-source contribution.

  • Canonical procedural-memory benchmark

    SkillEvolBench/SEA-Eval are days-old & unconsolidated; adopt the freeze-then-deploy arc + No-Skill / Raw-Trajectory attribution controls ('Harness Updating ≠ Benefit').

  • Forgetting / obsolescence metric suite

    Thinnest area: STALE (55.2% ceiling) + AMC continual-RL metrics are the only hard numbers; standardize a retention-curve + obsolescence-precision/recall protocol for textual memory.

  • Synthetic-data realism metric

    OmniBehavior/REALTALK prove the gap; build a fidelity score + controllable conflict/distractor injection (MemConflict recipe) + QDC auditing.

  • Shared memory-security harness

    RSR@k / ASR / Benign + over-refusal + cross-user leakage + provenance-violation, covering episodic/procedural stores (today's attacks only hit semantic).

  • Unbenchmarked production axes

    Multi-user isolation, proactive memory use, multimodal long-horizon recall.

Explore the links