Tool-Calling Decisions Are Linearly Steerable in LLMs

The Problem: Silent Tool-Calling Failures

In production agentic systems, incorrect tool selection ranks among the most dangerous failure modes because the error is invisible at decision time. A model that routes a request to the wrong tool, whether sending an email instead of drafting one, or canceling a booking instead of querying it, produces consequences before any monitoring layer can intervene. Current approaches catch these mistakes only after execution, when the downstream effect has already propagated [1].

The absence of a pre-execution signal forces operators to rely on post-hoc logging, retry logic, or human review. None of those mechanisms prevent the initial wrong action. Researchers studying the internal geometry of tool-calling models have now identified a structural property that could change that calculus.

What the Research Found

A study probing 12 instruction-tuned language models across the Gemma 3, Qwen 3, Qwen 2.5, and Llama 3.1 families, ranging from 270 million to 27 billion parameters, found that the identity of the tool a model intends to call is linearly readable inside its activations before any output token is generated [1].

The finding held across all 12 tested models, suggesting the property is not specific to a single architecture or training recipe. The researchers also found that base models, before instruction tuning, already encode the correct tool in their internal states: cosine readout from those states recovers 69 to 82 percent accuracy on the Berkeley Function-Calling Leaderboard benchmark, while base-model generation alone reaches only 2 to 10 percent. That gap implies pretraining forms the underlying representation, and instruction tuning subsequently connects it to the output vocabulary [1].

How the Steering Mechanism Works

The core intervention is a mean-difference vector. For any two tools in a model’s available set, the researchers computed the average internal activations the model produces when selecting each tool, then subtracted one mean from the other. Adding that difference vector to the model’s activations at inference time switches the selected tool at 77 to 100 percent accuracy on name-only single-turn prompts, with accuracy rising to 93 to 100 percent for models at 4 billion parameters and above [1].

Critically, the JSON arguments generated after the tool name also conform to the new tool’s schema automatically. Because argument generation is autoregressive and conditioned on the tool name token, flipping the name is sufficient to redirect the full structured output. Operators do not need to separately rewrite the argument payload [1].

The causal effect concentrates along a single direction in the model’s representation space: the row of the output layer responsible for producing the target tool’s first token. A unit vector along that row, scaled to matched magnitude, already achieves 93 to 100 percent steering accuracy. The remaining activation components leave the tool choice largely untouched, indicating the signal is both localized and geometrically clean [1].

Localizing the Signal: Attention Heads and Layers

Activation patching experiments identified where in the network the tool-selection signal originates. The causal locus concentrates in a small set of mid- and late-layer attention heads rather than being distributed across the full model [1].

The researchers also tested whether the linear readout was simply tracking topic or domain rather than tool identity. Using 14 same-domain tools from the tau-bench airline benchmark, all sharing the same subject matter, a within-topic probe recovered top-1 tool identity at 61 to 89 percent accuracy across five models in the 4 billion to 14 billion parameter range. That result rules out the interpretation that the steering vector is moving the model along a topic axis rather than a tool-identity axis [1].

Pre-Execution Error Detection

Beyond steering, the same per-tool activation means support a confidence signal that can flag likely errors before execution begins. The researchers measured the gap between the top-1 and top-2 tool scores derived from internal activations. On Gemma 3 12B and 27B, queries where that gap is smallest produce 14 to 21 times more wrong calls than queries where the gap is largest [1].

For operators, this gap functions as a pre-flight confidence check. A narrow margin between the leading and second-ranked tool in activation space signals that the model is uncertain, and the call can be held for review or re-routed before any action executes. This is structurally different from post-hoc error detection because the signal is available at the same moment the model is deciding, not after the decision has produced an effect.

Implications for Agent Reliability

The findings open several practical directions for teams running tool-calling agents in production. Linear steerability means runtime intervention is mechanistically feasible: a monitoring layer with access to model activations could detect a likely wrong tool selection and apply a correction vector before the output token is committed [1].

The researchers note that the measurements were conducted in single-turn, fixed-menu settings, and that multi-turn agentic transfer is more fragile. That limitation is relevant for operators considering deployment: the technique’s reliability in multi-step agent loops, where tool menus and context accumulate across turns, has not been established at the same confidence levels as the single-turn results [1].

Nevertheless, the concentration of the tool-selection signal in a small number of attention heads and a single output-layer direction means instrumentation does not require full-model interpretability infrastructure. Teams already collecting intermediate activations for other monitoring purposes may be positioned to add tool-confidence scoring with relatively contained engineering effort.

FAQ

Q. Which model families and sizes does the research cover? The study tested 12 instruction-tuned models spanning Gemma 3, Qwen 3, Qwen 2.5, and Llama 3.1, with parameter counts from 270 million to 27 billion [1]. Steering accuracy at 93 to 100 percent was observed specifically at 4 billion parameters and above.

Q. Does steering the tool name also fix the argument JSON, or does that require separate handling? Argument JSON follows automatically. Because generation is autoregressive and conditioned on the tool name token, redirecting the name is sufficient for the arguments to conform to the new tool’s schema without additional intervention [1].

Q. Does this technique work in multi-turn agentic settings? The researchers explicitly flag multi-turn transfer as more fragile and discuss it in their limitations section. The high accuracy figures apply to single-turn, fixed-menu prompts [1]. Operators should not assume the same reliability in multi-step agent loops without further validation.

Q. How is the pre-execution confidence score computed? The score is the gap between the top-1 and top-2 tool activation means for a given query. A small gap correlates with a 14 to 21 times higher rate of wrong calls on Gemma 3 12B and 27B [1], making it a practical threshold signal for routing uncertain queries to human review.

Q. Do base models (before instruction tuning) support this readout? Yes. Cosine readout from base model activations recovers 69 to 82 percent tool-identity accuracy on the Berkeley Function-Calling Leaderboard, compared to 2 to 10 percent from base-model generation alone, suggesting the representation is formed during pretraining [1].

Key takeaways

Tool identity is linearly encoded in activations across all 12 tested models from the Gemma 3, Qwen 3, Qwen 2.5, and Llama 3.1 families, from 270M to 27B parameters.
Adding a mean-difference vector between two tools’ activations switches the selected tool at 77 to 100 percent accuracy, with argument JSON automatically conforming to the redirected tool’s schema.
The causal signal concentrates in a small set of mid- and late-layer attention heads and in a single output-layer row, making targeted instrumentation feasible without full-model interpretability infrastructure.
The gap between top-1 and top-2 tool activation scores predicts errors 14 to 21 times more often when narrow, enabling pre-execution confidence checks on Gemma 3 12B and 27B.
Results are validated in single-turn, fixed-menu settings; multi-turn agentic reliability is explicitly flagged as more fragile in the research limitations.