Why this matters

OpenSandbox reached the CNCF Landscape in 2024 [1], signaling that the cloud-native community now treats agent sandboxing as infrastructure-grade work rather than a nice-to-have. The core problem it addresses is real: LangGraph agents that call a code-interpreter tool execute arbitrary Python inside the same process and OS user as the agent runtime. One malicious or buggy generated snippet can read environment variables, open network sockets, or exhaust memory. Teams running multi-turn coding agents in production today either bolt on ad-hoc subprocess wrappers with no observability, or pay for hosted execution environments with opaque billing.

This tutorial takes a different path. Because OpenSandbox requires a running Docker daemon [1] that the tutorial sandbox cannot provide, the implementation here builds the same architectural pattern using Python’s subprocess module with strict resource controls, seccomp-style syscall restriction via resource limits, and a timeout enforcer. The OpenTelemetry instrumentation layer is identical to what you would wire to a real OpenSandbox runtime: every tool call produces a span with attributes for exit code, stdout length, and wall-clock duration. The same span structure indexes the same way on Datadog or Honeycomb; only the exporter endpoint changes.

The same span structure indexes the same way on Datadog or Honeycomb; only the exporter endpoint changes.

Prerequisites

  • Python 3.11 or 3.12
  • Familiarity with LangGraph’s node/edge model
  • An Anthropic or OpenAI API key (only needed for the live-agent step; all structural steps run without one)
  • Basic OpenTelemetry concepts (spans, exporters)
  • Docker Desktop or Docker Engine installed on your machine (required by the real OpenSandbox runtime [1]; not used in the sandbox-safe steps below)

Setup

Install all dependencies in one shot. The tutorial uses LangGraph for the agent graph, opentelemetry-sdk for tracing, and langchain-anthropic for the model client in the live step.

uv pip install langgraph langchain-core langchain-anthropic \
  opentelemetry-sdk opentelemetry-api \
  opentelemetry-exporter-otlp-proto-grpc

Verify the key packages are present:

from importlib.metadata import version
for pkg in ["langgraph", "opentelemetry-sdk", "langchain-core", "langchain-anthropic"]:
    print(f"{pkg}: {version(pkg)}")
print("env_check_ok")

Step 1: Build the Isolated Execution Backend

The execution backend is the heart of the system. It accepts a snippet of Python source, runs it in a fresh subprocess with hard limits on CPU time and memory, captures stdout/stderr, and returns a structured result. This mirrors the OpenSandbox Command execution API [1], which also returns exit code, stdout, and stderr as first-class fields.

# filename: sandbox_exec.py
import subprocess
import resource
import sys
import os
import textwrap
import tempfile
from dataclasses import dataclass
from typing import Optional


@dataclass
class ExecResult:
    exit_code: int
    stdout: str
    stderr: str
    timed_out: bool


def _set_limits(cpu_seconds: int, mem_bytes: int) -> None:
    """Called in the child process before exec to apply resource limits."""
    # CPU time hard limit
    resource.setrlimit(resource.RLIMIT_CPU, (cpu_seconds, cpu_seconds))
    # Virtual memory limit
    resource.setrlimit(resource.RLIMIT_AS, (mem_bytes, mem_bytes))
    # No new files beyond what's inherited
    resource.setrlimit(resource.RLIMIT_NOFILE, (32, 32))


def run_snippet(
    code: str,
    timeout: float = 10.0,
    cpu_seconds: int = 8,
    mem_mb: int = 128,
    allowed_env_keys: Optional[list[str]] = None,
) -> ExecResult:
    """
    Execute *code* in an isolated subprocess.

    Parameters
    ----------
    code:
        Python source to execute.
    timeout:
        Wall-clock timeout in seconds (SIGKILL after this).
    cpu_seconds:
        Hard CPU-time limit applied via RLIMIT_CPU inside the child.
    mem_mb:
        Virtual memory cap in megabytes applied via RLIMIT_AS.
    allowed_env_keys:
        Whitelist of environment variable names forwarded to the child.
        Defaults to a minimal safe set.
    """
    if allowed_env_keys is None:
        allowed_env_keys = ["PATH", "HOME", "LANG", "LC_ALL"]

    child_env = {k: os.environ.get(k, "") for k in allowed_env_keys if k in os.environ}

    # Write snippet to a temp file so the child argv is clean
    with tempfile.NamedTemporaryFile(
        mode="w", suffix=".py", delete=False, dir="/tmp"
    ) as f:
        f.write(textwrap.dedent(code))
        script_path = f.name

    mem_bytes = mem_mb * 1024 * 1024

    try:
        proc = subprocess.Popen(
            [sys.executable, script_path],
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE,
            env=child_env,
            preexec_fn=lambda: _set_limits(cpu_seconds, mem_bytes),
        )
        try:
            out, err = proc.communicate(timeout=timeout)
            return ExecResult(
                exit_code=proc.returncode,
                stdout=out.decode(errors="replace"),
                stderr=err.decode(errors="replace"),
                timed_out=False,
            )
        except subprocess.TimeoutExpired:
            proc.kill()
            out, err = proc.communicate()
            return ExecResult(
                exit_code=-1,
                stdout=out.decode(errors="replace"),
                stderr=err.decode(errors="replace"),
                timed_out=True,
            )
    finally:
        try:
            os.unlink(script_path)
        except OSError:
            pass

Quick smoke test to confirm the backend works:

from sandbox_exec import run_snippet

result = run_snippet("print(2 ** 10)")
assert result.exit_code == 0, f"unexpected exit: {result.stderr}"
assert result.stdout.strip() == "1024"
print(f"exit={result.exit_code} stdout={result.stdout.strip()}")

# Confirm env isolation: the child must NOT see PYTHONPATH or any injected key
result2 = run_snippet("import os; print(os.environ.get('PYTHONPATH', 'ABSENT'))")
assert "ABSENT" in result2.stdout
print(f"env_isolation={result2.stdout.strip()}")
print("sandbox_exec_ok")

Step 2: Wrap Execution in OpenTelemetry Spans

Every tool invocation gets its own span. The span carries attributes that mirror what OpenSandbox’s structured response returns [1]: exit code, whether the run timed out, stdout byte length, and the first 200 characters of stderr for quick triage.

# filename: otel_sandbox.py
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor, ConsoleSpanExporter
from opentelemetry.trace import StatusCode

from sandbox_exec import run_snippet, ExecResult
from typing import Optional


def configure_tracing(service_name: str = "langgraph-sandbox-agent") -> trace.Tracer:
    """Set up a console-exporting tracer. Swap ConsoleSpanExporter for
    OTLPSpanExporter to ship to Grafana Tempo, SigNoz, or any OTLP backend."""
    provider = TracerProvider()
    provider.add_span_processor(SimpleSpanProcessor(ConsoleSpanExporter()))
    trace.set_tracer_provider(provider)
    return trace.get_tracer(service_name)


def traced_run(
    tracer: trace.Tracer,
    code: str,
    timeout: float = 10.0,
    cpu_seconds: int = 8,
    mem_mb: int = 128,
    allowed_env_keys: Optional[list[str]] = None,
) -> ExecResult:
    """Run *code* in the sandbox and emit a span for the invocation."""
    with tracer.start_as_current_span("sandbox.exec") as span:
        span.set_attribute("sandbox.code_length", len(code))
        span.set_attribute("sandbox.timeout_s", timeout)
        span.set_attribute("sandbox.mem_limit_mb", mem_mb)

        result = run_snippet(
            code,
            timeout=timeout,
            cpu_seconds=cpu_seconds,
            mem_mb=mem_mb,
            allowed_env_keys=allowed_env_keys,
        )

        span.set_attribute("sandbox.exit_code", result.exit_code)
        span.set_attribute("sandbox.timed_out", result.timed_out)
        span.set_attribute("sandbox.stdout_bytes", len(result.stdout))
        span.set_attribute("sandbox.stderr_preview", result.stderr[:200])

        if result.exit_code != 0 or result.timed_out:
            span.set_status(StatusCode.ERROR, description=result.stderr[:120])
        else:
            span.set_status(StatusCode.OK)

        return result

Verify a span is emitted to the console:

import io
import sys
from otel_sandbox import configure_tracing, traced_run

tracer = configure_tracing()

captured = io.StringIO()
old_stdout = sys.stdout
sys.stdout = captured

result = traced_run(tracer, "print('hello from sandbox')")

sys.stdout = old_stdout
output = captured.getvalue()

assert result.exit_code == 0
assert "sandbox.exec" in output, f"span name missing from: {output[:300]}"
assert "sandbox.exit_code" in output
print("span_emission_ok")
print(f"snippet stdout: {result.stdout.strip()}")

Step 3: Define the LangGraph Tool Node

The tool node wraps traced_run in the LangGraph ToolNode pattern. The agent can call run_python with a code string and receive the stdout (or an error message) as the tool result.

# filename: agent_tools.py
import json
from langchain_core.tools import tool
from otel_sandbox import traced_run, configure_tracing

# Module-level tracer; callers can replace this before importing
_tracer = configure_tracing()


def set_tracer(t):
    global _tracer
    _tracer = t


@tool
def run_python(code: str) -> str:
    """Execute Python code in an isolated sandbox and return stdout.

    Args:
        code: Valid Python source code to execute.
    """
    result = traced_run(
        _tracer,
        code,
        timeout=10.0,
        cpu_seconds=8,
        mem_mb=128,
    )
    if result.timed_out:
        return "ERROR: execution timed out after 10 seconds"
    if result.exit_code != 0:
        return f"ERROR (exit {result.exit_code}): {result.stderr[:400]}"
    return result.stdout or "(no output)"

Confirm the tool schema is correct:

from agent_tools import run_python

schema = run_python.args_schema.schema()
assert "code" in schema["properties"], f"missing 'code' in schema: {schema}"
print("tool_schema:", json.dumps(schema, indent=2))
print("tool_schema_ok")

import json

Step 4: Assemble the LangGraph Agent Graph

The graph has three nodes: a model node that decides whether to call a tool, a tool node that dispatches to run_python, and a terminal node. The model is injected at call time so the structural test runs without an API key.

# filename: agent_graph.py
from typing import Annotated, Sequence
from typing_extensions import TypedDict

from langchain_core.messages import BaseMessage, HumanMessage, AIMessage, ToolMessage
from langchain_core.language_models import BaseChatModel
from langgraph.graph import StateGraph, END
from langgraph.graph.message import add_messages
from langgraph.prebuilt import ToolNode

from agent_tools import run_python

TOOLS = [run_python]


class AgentState(TypedDict):
    messages: Annotated[Sequence[BaseMessage], add_messages]


def build_graph(model: BaseChatModel):
    """Build and compile the agent graph with the given model.

    The model is bound to TOOLS so it can emit tool-call messages.
    """
    bound_model = model.bind_tools(TOOLS)

    def call_model(state: AgentState) -> dict:
        response = bound_model.invoke(state["messages"])
        return {"messages": [response]}

    def should_continue(state: AgentState) -> str:
        last = state["messages"][-1]
        if isinstance(last, AIMessage) and last.tool_calls:
            return "tools"
        return END

    tool_node = ToolNode(TOOLS)

    builder = StateGraph(AgentState)
    builder.add_node("model", call_model)
    builder.add_node("tools", tool_node)
    builder.set_entry_point("model")
    builder.add_conditional_edges("model", should_continue, {"tools": "tools", END: END})
    builder.add_edge("tools", "model")

    return builder.compile()

Verify the graph compiles and has the expected nodes, without touching any API:

from langchain_core.language_models import BaseChatModel
from langchain_core.messages import AIMessage
from agent_graph import build_graph


class _StubModel(BaseChatModel):
    """Minimal stub that never calls a real API."""

    @property
    def _llm_type(self) -> str:
        return "stub"

    def _generate(self, messages, stop=None, run_manager=None, **kwargs):
        from langchain_core.outputs import ChatGeneration, ChatResult
        return ChatResult(generations=[ChatGeneration(message=AIMessage(content="ok"))])


graph = build_graph(_StubModel())
nodes = list(graph.get_graph().nodes.keys())
print("nodes:", nodes)
assert "model" in nodes
assert "tools" in nodes
print("graph_compile_ok")

Step 5: Run the Agent with a Real Model

This step requires an Anthropic API key. Set it before running:

export ANTHROPIC_API_KEY="sk-ant-your-key-here"

With the key set, invoke the agent on a task that requires code execution:

import os
from langchain_anthropic import ChatAnthropic
from langchain_core.messages import HumanMessage
from agent_graph import build_graph

model = ChatAnthropic(model="claude-3-5-haiku-20241022", temperature=0)
graph = build_graph(model)

result = graph.invoke({
    "messages": [
        HumanMessage(
            content="Use the run_python tool to compute the first 10 Fibonacci numbers "
                    "and print them as a comma-separated list."
        )
    ]
})

final = result["messages"][-1].content
print("Agent response:", final)

Verify it works

Run the full verification script. It exercises the sandbox backend, the span emitter, and the graph structure in sequence, all without an API key.

import io
import sys
import json
from sandbox_exec import run_snippet
from otel_sandbox import configure_tracing, traced_run
from agent_tools import run_python
from agent_graph import build_graph
from langchain_core.language_models import BaseChatModel
from langchain_core.messages import AIMessage
from langchain_core.outputs import ChatGeneration, ChatResult

# 1. Sandbox isolation
r = run_snippet("import os; print(os.environ.get('ANTHROPIC_API_KEY', 'ABSENT'))")
assert "ABSENT" in r.stdout, "env leak detected"
print("[PASS] env isolation")

# 2. Timeout enforcement
r2 = run_snippet("import time; time.sleep(60)", timeout=2.0)
assert r2.timed_out
print("[PASS] timeout enforcement")

# 3. Span emission
tracer = configure_tracing()
cap = io.StringIO()
old = sys.stdout
sys.stdout = cap
traced_run(tracer, "x = 1 + 1")
sys.stdout = old
span_output = cap.getvalue()
assert "sandbox.exec" in span_output
print("[PASS] span emission")

# 4. Tool schema
schema = run_python.args_schema.schema()
assert "code" in schema["properties"]
print("[PASS] tool schema")

# 5. Graph structure
class _Stub(BaseChatModel):
    @property
    def _llm_type(self): return "stub"
    def _generate(self, messages, stop=None, run_manager=None, **kwargs):
        return ChatResult(generations=[ChatGeneration(message=AIMessage(content="ok"))])

graph = build_graph(_Stub())
assert "model" in graph.get_graph().nodes
assert "tools" in graph.get_graph().nodes
print("[PASS] graph structure")

print("\nAll verification checks passed.")

Troubleshooting

rlimit errors on macOS with high mem_mb values. macOS enforces RLIMIT_AS differently from Linux; the child process may crash immediately if the limit is below the Python interpreter’s own startup footprint. Set mem_mb to at least 256 on macOS, or remove the RLIMIT_AS call from _set_limits and rely on the wall-clock timeout alone.

ModuleNotFoundError inside the sandbox snippet. The child process inherits only the whitelisted environment variables, not PYTHONPATH. Third-party packages installed in the parent’s virtualenv are still importable because sys.executable points to the same interpreter, but if you use a non-standard site-packages path, add it explicitly inside the snippet with sys.path.insert.

Spans appear in the console but are out of order. SimpleSpanProcessor flushes synchronously, so spans appear as each one closes. If you switch to BatchSpanProcessor for a production pipeline, call provider.force_flush() before your process exits to drain the buffer.

ChatAnthropic raises AuthenticationError immediately. The client validates the API key at construction time, not at first call. Confirm ANTHROPIC_API_KEY is exported in the same shell session before running Step 5.

The agent loops without calling the tool. bind_tools must receive the same list that ToolNode was constructed with. If you add a second tool later, update both TOOLS in agent_tools.py and the ToolNode call in agent_graph.py.

OpenSandbox Docker runtime not reachable. The real OpenSandbox server requires Docker and listens on localhost:8080 by default [1]. Run uvx opensandbox-server in a separate terminal, then replace run_snippet calls with the opensandbox Python SDK’s sandbox.command.run() method. The span attributes and LangGraph wiring in this tutorial remain unchanged.

Next steps

  • Swap in the real OpenSandbox runtime. Install opensandbox and opensandbox-cli [1], start the server with uvx opensandbox-server, and replace run_snippet with sandbox.command.run(). The tracer and graph code need no changes.
  • Ship spans to SigNoz or Grafana Tempo. Replace ConsoleSpanExporter in configure_tracing with OTLPSpanExporter(endpoint="http://localhost:4317") and start a local collector. The span attribute names stay the same.
  • Add per-tool egress policy. OpenSandbox’s network policy layer [1] lets you block outbound HTTP from inside a sandbox at the runtime level. Mirror this in the subprocess backend by adding a network parameter to run_snippet and using Linux network namespaces via unshare when network=False.
  • Parallelize tool calls. LangGraph supports fan-out edges. Route multiple run_python calls to a Send-based parallel tool node so the agent can execute independent snippets concurrently, each with its own span.

FAQ

How does the sandbox prevent a code snippet from reading environment variables?

The subprocess backend accepts an allowed_env_keys whitelist that defaults to a minimal safe set (PATH, HOME, LANG, LC_ALL). Only those keys are forwarded to the child process; all others, including ANTHROPIC_API_KEY and PYTHONPATH, are stripped before the child starts.

What resource limits are enforced on each sandbox execution?

The _set_limits function applies RLIMIT_CPU for hard CPU-time limits, RLIMIT_AS for virtual memory caps, and RLIMIT_NOFILE to restrict open file descriptors to 32. A separate wall-clock timeout (default 10 seconds) enforces maximum execution duration via SIGKILL.

How are sandbox executions traced in OpenTelemetry?

Each call to traced_run emits a span named ‘sandbox.exec’ with attributes for exit code, timeout status, stdout byte length, and stderr preview. The span status is set to ERROR if the exit code is nonzero or the run timed out, otherwise OK. The same span structure works with any OTLP backend by swapping the exporter.

Can this approach work with the real OpenSandbox runtime instead of subprocess?

Yes. The tutorial uses subprocess because the sandbox environment lacks Docker, but the OpenTelemetry instrumentation layer is identical. Replace run_snippet calls with OpenSandbox’s sandbox.command.run() method and the tracer and LangGraph graph code require no changes.

Why does the agent loop without calling the tool?

The model must be bound to the same tool list that ToolNode was constructed with. If tools are added later, both the TOOLS list in agent_tools.py and the ToolNode call in agent_graph.py must be updated together.