Two Problems, One Architecture
Memory-augmented LLM agents store, update, and reuse information across sessions to extend beyond finite context windows, but two arxiv papers published this week show that the architecture introduces compounding failure modes at both training time and inference time [1][2]. Neither failure mode is addressed by existing standard methods, and both carry direct consequences for production deployments.
The first paper introduces Memory-R2, a training framework targeting biased credit assignment in multi-session reinforcement learning [1]. The second identifies a phenomenon called memory-induced tool-drift, operationalized through a benchmark named MEMDRIFT, which demonstrates that personality biases stored in memory silently skew tool-call parameters across seven frontier models [2]. Together, the papers map distinct but related vulnerabilities in the same underlying architecture.
Credit Assignment in Long-Horizon RL: The Memory-R2 Approach
The core problem Memory-R2 addresses stems from how memory interacts with group-relative policy optimization. In multi-session environments, different rollouts write, update, or delete different memories, meaning they no longer share the same intermediate memory state by the time trajectory-level rewards are computed. This breaks a foundational assumption of GRPO-style methods, which compare rollouts as if sampled from the same effective environment [1].
When that assumption fails, trajectory-level rewards produce noisy or biased credit signals for memory operations. Memory-R2’s core algorithm, LoGo-GRPO, addresses this through a combination of local and global group-relative optimization. The global objective preserves end-to-end learning from long-horizon trajectory-level rewards. The local component introduces rerollouts that compare different memory-operation outcomes starting from the same intermediate memory state, producing fairer group comparisons and more precise supervision for memory construction [1].
To stabilize multi-step reinforcement learning across extended memory horizons, Memory-R2 also adopts a progressive curriculum that increases the training horizon from 8 to 16 to 32 sessions [1].
Co-Learning Memory Formation and Evolution
Beyond the credit assignment fix, Memory-R2 addresses a second structural challenge: memory formation and memory evolution have typically been treated as separate concerns. The framework jointly optimizes both through a shared-parameter co-learning design [1].
In this design, a fact extractor and a memory manager are both instantiated from the same LLM backbone, differentiated through role-specific prompts rather than separate model weights. The fact extractor handles memory construction, while the memory manager handles long-horizon updates. Sharing parameters across both roles allows the training signal from one to inform the other, providing a unified training paradigm for multi-session memory-augmented agents [1].
Memory-Induced Tool-Drift and the MEMDRIFT Benchmark
The second paper examines what happens at inference time when personality-driven biases, such as cost-consciousness, impatience, or risk tolerance, are stored in an agent’s long-term memory and then influence tool calls in contexts where those biases are not applicable [2].
The authors operationalize this failure mode through MEMDRIFT, a benchmark of 105 scenarios spanning five bias dimensions and seven professional domains. Scenarios are generated through an automated adversarial pipeline. Performance is measured using deflection scores, a judge-scored metric capturing parameter deviation from unbiased baselines [2].
Across seven frontier models, including models with extended reasoning capabilities, biased memories raised deflection scores by up to 3.6 points on a 1-to-5 scale. The paper also reports that tool-drift persists when memory management is handled by three production memory architectures, indicating the problem is not specific to a single memory system design [2].
Mechanistic Findings and Real-World Exposure
The paper goes beyond behavioral measurement to provide an activation-level account of why tool-drift occurs. Biased memories act as implicit steering vectors, pushing model activations along the same latent directions as explicit behavioral instructions. Additionally, biased memories redistribute attention away from task-relevant context and toward memory entries that share surface-level keyword overlap with the target parameter [2].
To assess real-world exposure, the authors scanned 6,062 tools across 288 verified MCP servers. Of those, 608 tools were flagged as having susceptible parameters, and tool-drift was confirmed on a validated subset [2]. The scale of that scan provides a concrete estimate of how broadly the vulnerability extends across deployed tooling infrastructure.
Implications for Production Memory Architectures
For teams building or operating memory-augmented agents, both papers point to gaps that standard engineering practices do not currently close.
On the training side, Memory-R2 identifies that applying GRPO-style optimization directly to multi-session memory agents produces systematically biased credit signals, and proposes LoGo-GRPO as a corrective. The progressive curriculum and shared-parameter co-learning design are presented as components of a coherent training paradigm rather than isolated patches [1].
On the inference side, the MEMDRIFT findings carry a specific caution: prompt-based relevance instructions and memory filters reduce tool-drift but do not eliminate it [2]. That result limits the effectiveness of purely prompt-level defenses and, according to the paper, motivates dedicated defenses at the intersection of memory management and tool-call generation. The persistence of drift across three production memory architectures and seven frontier models suggests the vulnerability is architectural rather than model-specific.
FAQ
Q. Does Memory-R2 require separate model weights for the fact extractor and memory manager? No. Both components are instantiated from the same LLM backbone using role-specific prompts, not separate parameters. The shared-parameter co-learning design is central to the framework’s approach [1].
Q. Does tool-drift only affect smaller or less capable models? No. The MEMDRIFT benchmark tested seven frontier models, including those with extended reasoning capabilities, and found that biased memories raised deflection scores across all of them [2].
Q. Can prompt-based defenses fully mitigate memory-induced tool-drift? Not according to the paper’s findings. Prompt-based relevance instructions and memory filters reduced drift but did not eliminate it, which the authors cite as motivation for dedicated defenses beyond standard prompt engineering [2].
Q. How many real-world tools were found to be susceptible to tool-drift? The authors scanned 6,062 tools across 288 verified MCP servers and flagged 608 tools as having susceptible parameters, with tool-drift confirmed on a validated subset [2].
Q. What is the training horizon range used in Memory-R2’s progressive curriculum? The curriculum increases the training horizon from 8 sessions to 16 sessions to 32 sessions, intended to stabilize multi-step reinforcement learning over long memory horizons [1].
Key takeaways
- Memory-R2 introduces LoGo-GRPO, which combines local rerollouts from shared intermediate memory states with a global trajectory objective to correct biased credit assignment in multi-session RL training [1].
- A shared-parameter co-learning design in Memory-R2 jointly trains a fact extractor and memory manager from the same LLM backbone, unifying memory formation and memory evolution [1].
- MEMDRIFT benchmarks memory-induced tool-drift across 105 scenarios, five bias dimensions, and seven professional domains, finding deflection score increases of up to 3.6 points across seven frontier models [2].
- Mechanistic analysis shows biased memories function as implicit steering vectors and shift attention toward keyword-overlapping memory entries rather than task-relevant context [2].
- Prompt-based defenses reduce but do not eliminate tool-drift, and 608 susceptible tool parameters were identified across 288 verified MCP servers, indicating broad real-world exposure [2].