The Over-Refusal Problem in LLM Agent Safety

As LLM agents gain more sophisticated tool-use capabilities, operators face a persistent tension in safety design: tightening guardrails to block harmful requests tends to degrade performance on legitimate ones. Existing defensive mechanisms, while effective at reducing harmful outputs, frequently trigger what researchers call the over-refusal problem, where increased safety strictness causes agents to reject benign tasks they should handle [1]. For production deployments, this trade-off has practical consequences. An agent that refuses too broadly becomes unreliable for end users, while one tuned for utility leaves attack surface exposed. Neither outcome is acceptable in high-stakes environments, and the gap between the two has remained a persistent challenge for agent operators.

What SafeHarbor Is

SafeHarbor is a guardrail framework designed to establish precise decision boundaries for LLM agents without requiring any model retraining. Researchers describe it as a plug-and-play solution, meaning it can be layered onto existing agent architectures at inference time rather than integrated during a training pipeline [1]. The core mechanism involves extracting context-aware defense rules through enhanced adversarial generation, replacing the static guidelines that characterize most current safety approaches. Because the framework operates without fine-tuning, operators can apply it to already-deployed models, reducing the cost and complexity of adoption. The source code is publicly available at https://github.com/ljj-cyber/SafeHarbor, lowering the barrier for teams evaluating it against their own workloads [1].

How the Hierarchical Memory System Works

The framework’s central component is a local hierarchical memory system that stores defense rules and injects them dynamically at inference time [1]. Rather than applying a fixed set of safety instructions uniformly across all requests, the system retrieves rules that are contextually relevant to the specific input an agent is processing. This context-sensitive injection is what allows SafeHarbor to distinguish between ambiguous benign requests and genuinely malicious ones, a distinction that flat, static rule sets struggle to make reliably. The hierarchical structure organizes rules across multiple levels, enabling finer-grained matching between incoming queries and the appropriate defensive context.

Self-Evolution via Information Entropy

Beyond static rule storage, SafeHarbor incorporates an information entropy-based self-evolution mechanism that continuously restructures the memory system as new cases are encountered [1]. The mechanism works through dynamic node splitting and merging. When a memory node accumulates cases that are sufficiently diverse, measured by information entropy, the node splits to create more granular categories. Conversely, nodes covering cases that are too similar merge to reduce redundancy. The result is a memory structure that adapts over time, refining its decision boundaries without requiring manual updates or retraining cycles. For operators managing agents that encounter evolving threat patterns, this self-optimization property reduces ongoing maintenance overhead.

Benchmark Results and Comparisons

Experimental results position SafeHarbor at state-of-the-art performance across two evaluation dimensions: ambiguous benign tasks and explicit malicious attacks [1]. On GPT-4o, the framework achieved a peak benign utility of 63.6%, meaning agents successfully completed that share of legitimate requests. Simultaneously, the refusal rate against harmful requests exceeded 93% [1]. The researchers describe these figures as representing state-of-the-art outcomes compared with prior methods, though the single available source does not enumerate the specific competing systems or their individual scores. The combination of high refusal rates with meaningful benign utility is the key operational result, demonstrating that the over-refusal trade-off can be reduced without sacrificing safety coverage.

Availability and Practical Applicability

The framework is designed to support a range of deployment scenarios. Its training-free, plug-and-play architecture means it is not tied to a specific model family or agent infrastructure, making it applicable to operators running different LLM backends [1]. The public code release at https://github.com/ljj-cyber/SafeHarbor allows engineering teams to evaluate SafeHarbor against their own agent configurations before committing to integration. For teams currently relying on static system-prompt guardrails or rule-based filters, SafeHarbor represents a migration path toward dynamic, context-aware defense without the overhead of retraining or fine-tuning existing models. The entropy-based memory evolution also means the system can be deployed and then left to refine itself as it processes real traffic, rather than requiring periodic manual rule updates.

FAQ

Q. Does SafeHarbor require modifying or retraining the underlying LLM? No. The framework is explicitly described as training-free and plug-and-play, operating at inference time by injecting defense rules from its hierarchical memory system [1]. Operators can apply it to already-deployed models.

Q. Which models has SafeHarbor been tested on? The available research reports experiments on GPT-4o, where the system achieved 63.6% benign utility and a refusal rate exceeding 93% [1]. The source does not detail results on other specific models.

Q. How does the memory system stay current as new attack patterns emerge? SafeHarbor uses an information entropy-based self-evolution mechanism that performs dynamic node splitting and merging as new cases are encountered, continuously restructuring the memory without manual intervention [1].

Q. Is the code available for evaluation before full integration? Yes. The source code is publicly available at https://github.com/ljj-cyber/SafeHarbor, allowing teams to test the framework against their own workloads prior to production deployment [1].

Q. What is the practical trade-off operators should expect? Based on reported results, operators can expect a refusal rate above 93% on harmful requests alongside roughly 63.6% utility on benign tasks when running on GPT-4o [1]. The source does not provide latency or throughput figures, so infrastructure overhead remains an open evaluation point.

Key takeaways

  • SafeHarbor is a training-free, plug-and-play guardrail framework that injects context-aware defense rules at inference time, requiring no model retraining [1].
  • A local hierarchical memory system stores and dynamically retrieves rules matched to the specific context of each incoming request, enabling finer-grained safety decisions than static rule sets [1].
  • An information entropy-based mechanism continuously splits and merges memory nodes as new cases arrive, allowing the system to adapt to evolving threat patterns without manual updates [1].
  • On GPT-4o, SafeHarbor achieved 63.6% benign utility while maintaining a harmful-request refusal rate exceeding 93%, addressing the over-refusal trade-off that limits existing approaches [1].
  • Source code is publicly available, supporting evaluation across different agent architectures and deployment environments [1].