Nautilus Compass Detects LLM Agent Persona Drift Without Model Weights

The Persona Drift Problem in Long Coding Sessions

Production LLM coding agents degrade in predictable ways over extended sessions. They forget constraints the user specified earlier, repeat mistakes that were already corrected, and confabulate agreements that were never made [1]. The problem compounds as context windows grow and agents are asked to maintain consistent behavior across hours of interaction.

Existing technical solutions largely depend on access to model internals. Persona vector approaches, for instance, require direct inspection of model weights to track behavioral state. That requirement immediately disqualifies them from any deployment that runs against a closed API, which describes the majority of production coding agent setups using services such as Anthropic Claude or GPT-4 [1]. Operators running Claude Code or similar tools have had no verified, public mechanism for detecting when their agent has drifted from its configured behavioral profile.

What Nautilus Compass Is

Nautilus Compass is a black-box persona drift detector and agent memory layer designed specifically for production coding agents. The system operates without any access to model weights, making it compatible with closed-API deployments that white-box alternatives cannot reach [1].

The project ships as four integrated components running on a single daemon: a Claude Code plugin, an MCP 2024-11-05 A2A server, a command-line interface, and a REST API. The MCP server layer extends compatibility to Cursor, Cline, and Hermes clients. Code, anchor files, frozen test data, and audit-log tooling are released under an MIT license at github.com/chunxiaoxx/nautilus-compass [1].

How the Detection Method Works

The core detection mechanism operates entirely at the prompt-text layer, meaning no model calls are required during the indexing phase. When a user prompt arrives, the system computes cosine similarity between that prompt and a set of behavioral anchor texts that define the expected persona [1].

Those similarity scores are aggregated using a weighted top-k mean rather than a simple average, which gives higher influence to the most semantically relevant anchors for a given prompt. Because the pipeline embeds raw conversation text directly rather than first extracting structured facts through an LLM call, the index-time cost is substantially lower than extraction-based approaches. The researchers report an end-to-end reproduction cost of approximately $3.50, roughly 14 times cheaper than GPT-4o-judged stacks [1].

Benchmark Results and Honest Limitations

On a held-out test set constructed from real Claude Code session traces and labeled by an independent LLM judge, Nautilus Compass achieves a ROC AUC of 0.83 for drift detection [1]. The researchers describe this as the primary performance claim for the system’s core function.

The retrieval pipeline was also evaluated on two external benchmarks. On LongMemEval-S v0.8, the system scores 56.6%. On EverMemBench-Dynamic with a sample of 500 items, it scores 44.4%, which the paper reports as topping the four published baselines in EverMemBench Table 4 [1].

The authors are direct about the ceiling of the no-extraction design. The LongMemEval-S score of 56.6% sits roughly 30 percentage points below recent white-box leaders, which have reached above 90% on that benchmark. The paper frames that gap as the architectural cost of avoiding LLM-based fact extraction at index time, treating it as a known trade-off rather than a deficiency to be minimized [1]. Operators who require retrieval accuracy at the level of white-box systems will need a different architecture.

Audit Infrastructure and Deployment

Nautilus Compass includes a Merkle-chained audit log for anchor updates. The chaining structure makes the log tamper-evident: any modification to a prior anchor entry would break the chain and become detectable. This is relevant for production deployments where behavioral anchors may be updated over time and where compliance or reproducibility requirements demand a verifiable record of those changes [1].

Deployment is handled through the single-daemon architecture, which means operators do not need to run separate services for the plugin, the MCP server, the CLI, and the REST API. MCP 2024-11-05 compatibility covers Cursor, Cline, and Hermes in addition to Claude Code [1].

Where Compass Fits Among Memory Layer Alternatives

The researchers conducted a survey of public agent memory layers as of May 2026, examining Mem0, Letta, Cognee, Zep, MemOS, and smrti alongside Nautilus Compass. Their finding is that Compass is the only verified public system in that group that does not call an LLM at index time to extract facts or construct a knowledge graph [1].

The other systems in the comparison generally use LLM calls during indexing to parse conversation text into structured representations, whether entity graphs, fact triples, or summarized memory entries. That approach tends to produce higher retrieval accuracy, as reflected in the 30-point LongMemEval-S gap, but it also increases index-time cost and introduces a dependency on an external model at write time. Nautilus Compass trades retrieval ceiling for index-time simplicity and closed-API compatibility, a position that has no direct equivalent among the surveyed alternatives [1].

FAQ

Q. Can Nautilus Compass be used with LLM providers other than Anthropic Claude? The system is designed as a black-box layer that requires no model weight access, so it is architecturally compatible with any closed-API provider. The paper focuses on Claude Code session traces for benchmarking, but the MCP 2024-11-05 server supports Cursor, Cline, and Hermes clients as well [1].

Q. What is the practical cost of running Compass in production? The researchers report an end-to-end reproduction cost of approximately $3.50, which they describe as roughly 14 times cheaper than GPT-4o-judged stacks. That figure reflects the absence of LLM calls at index time, since raw text is embedded directly rather than processed through a generative model [1].

Q. How does the 30-point retrieval gap affect real deployments? Operators who need retrieval accuracy above roughly 57% on long-context memory tasks will find the no-extraction architecture insufficient for those use cases. The authors explicitly frame the gap as the architectural ceiling of the design, not a bug, meaning it is unlikely to close without adding LLM-based extraction [1].

Q. Is the test data independent, or was it used during development? The benchmark test set was built from real Claude Code session traces and labeled by an independent LLM judge. The paper describes it as a held-out set, and the frozen test data is published alongside the code under the MIT license at the project repository [1].

Q. What does the Merkle-chained audit log actually protect against? The chaining structure ensures that any retroactive modification to an anchor entry would produce a detectable break in the chain. This is primarily relevant for compliance scenarios where operators need a verifiable record of how behavioral anchors have changed over time [1].

Key takeaways

Nautilus Compass detects persona drift in production LLM coding agents using only prompt-text cosine similarity with BGE-m3 embeddings, requiring no model weight access and enabling use with closed APIs such as Anthropic Claude.
The system achieves ROC AUC 0.83 on drift detection and 44.4% on EverMemBench-Dynamic, topping four published baselines, while sitting approximately 30 points below white-box leaders on LongMemEval-S.
Among six surveyed public agent memory layers (Mem0, Letta, Cognee, Zep, MemOS, smrti), Nautilus Compass is the only one verified to skip LLM calls at index time, reducing reproduction cost to roughly $3.50.
The single-daemon deployment covers a Claude Code plugin, MCP 2024-11-05 A2A server, CLI, and REST API, with Merkle-chained audit logging for tamper-evident anchor tracking.
The authors explicitly acknowledge the no-extraction design imposes a retrieval ceiling, framing the 30-point accuracy gap as an architectural trade-off rather than a solvable defect.