Why this matters

LiteLLM’s proxy mode turns a single OpenAI-compatible endpoint into a gateway that can fan traffic across OpenAI, Mistral, Anthropic, and local vLLM instances [2]. Without a layer like this, multi-provider agent systems require provider-specific client code, separate retry logic, and ad-hoc cost accounting scattered across the codebase. When a provider goes down or a model gets deprecated, every call site needs a patch.

LangChain’s tool-calling interface [1] expects an OpenAI-compatible chat completions endpoint, which means the LiteLLM proxy slots in as a drop-in replacement. The combination lets you define routing rules, fallbacks, and budget caps in a single YAML file rather than in application code. This tutorial wires those two pieces together and adds a lightweight callback that writes cost and latency to a local JSONL file, giving you a structured audit trail without a commercial observability vendor.

Prerequisites

  • Python 3.11 or 3.12
  • At least one LLM API key (OpenAI or Mistral). A second key enables live fallback testing; the tutorial degrades gracefully with one.
  • Basic familiarity with LangChain agents and tool definitions
  • Docker is listed in the topic prerequisites but is NOT required here. The LiteLLM proxy runs in-process via its Python SDK, so no Docker daemon is needed.

Setup

Install the required packages. LiteLLM ships the proxy server as part of its main package [2], so one install covers both the gateway and the Python SDK.

uv pip install litellm langchain langchain-openai langchain-core

Export your provider keys. The tutorial uses OpenAI as the primary model and Mistral as the fallback. If you only have one key, set both variables to the same value and point both model entries at the same provider.

export OPENAI_API_KEY="sk-..."
export MISTRAL_API_KEY="..."

Step 1: Write the LiteLLM proxy configuration

The proxy reads a YAML file that declares models, routing strategy, and optional budget limits. The router_settings block tells LiteLLM to try the first healthy model and fall back to the next on 5xx errors or timeouts [2].

# filename: litellm_config.yaml
model_list:
  - model_name: gpt-4o-mini
    litellm_params:
      model: openai/gpt-4o-mini
      api_key: os.environ/OPENAI_API_KEY

  - model_name: mistral-small
    litellm_params:
      model: mistral/mistral-small-latest
      api_key: os.environ/MISTRAL_API_KEY

  - model_name: primary-agent-model
    litellm_params:
      model: openai/gpt-4o-mini
      api_key: os.environ/OPENAI_API_KEY

router_settings:
  routing_strategy: simple-shuffle
  num_retries: 2
  timeout: 30

litellm_settings:
  success_callback: []
  failure_callback: []
  drop_params: true

Step 2: Build the cost-and-latency callback

LangChain callbacks fire on every LLM call. The CostLatencyLogger below captures the wall-clock time around each call and asks LiteLLM’s completion_cost helper for the token cost. Each record is appended as a JSON line to traces.jsonl.

# filename: callbacks.py
import json
import time
from pathlib import Path
from typing import Any

from langchain_core.callbacks.base import BaseCallbackHandler

TRACE_FILE = Path("/workspace/traces.jsonl")


class CostLatencyLogger(BaseCallbackHandler):
    """Appends one JSON record per LLM call to traces.jsonl."""

    def __init__(self):
        super().__init__()
        self._start: dict[str, float] = {}

    def on_llm_start(
        self, serialized: dict[str, Any], prompts: list[str], **kwargs: Any
    ) -> None:
        run_id = str(kwargs.get("run_id", "unknown"))
        self._start[run_id] = time.perf_counter()

    def on_llm_end(self, response: Any, **kwargs: Any) -> None:
        run_id = str(kwargs.get("run_id", "unknown"))
        elapsed = time.perf_counter() - self._start.pop(run_id, time.perf_counter())

        usage = {}
        cost_usd = 0.0
        model_name = "unknown"

        try:
            gen = response.generations[0][0]
            if hasattr(gen, "generation_info") and gen.generation_info:
                info = gen.generation_info
                model_name = info.get("model", "unknown")
                usage = info.get("usage", {})

            # LiteLLM exposes completion_cost for known models
            import litellm
            if usage and model_name != "unknown":
                cost_usd = litellm.completion_cost(
                    completion_response={
                        "model": model_name,
                        "usage": {
                            "prompt_tokens": usage.get("prompt_tokens", 0),
                            "completion_tokens": usage.get("completion_tokens", 0),
                            "total_tokens": usage.get("total_tokens", 0),
                        },
                    }
                )
        except Exception:
            pass

        record = {
            "run_id": run_id,
            "model": model_name,
            "latency_s": round(elapsed, 4),
            "cost_usd": round(cost_usd, 8),
            "prompt_tokens": usage.get("prompt_tokens", 0),
            "completion_tokens": usage.get("completion_tokens", 0),
        }
        with TRACE_FILE.open("a") as fh:
            fh.write(json.dumps(record) + "\n")

Step 3: Define the agent tools

Two simple tools give the agent something to call. get_weather and calculate are pure Python functions decorated with @tool from LangChain [1]. In a real system these would call external APIs, but keeping them local means the tutorial runs without extra credentials.

# filename: tools.py
from langchain_core.tools import tool


@tool
def get_weather(city: str) -> str:
    """Return a mock current weather report for a city."""
    data = {
        "london": "12°C, overcast",
        "paris": "18°C, sunny",
        "new york": "22°C, partly cloudy",
        "tokyo": "25°C, humid",
    }
    return data.get(city.lower(), f"No weather data available for {city}.")


@tool
def calculate(expression: str) -> str:
    """Evaluate a simple arithmetic expression and return the result."""
    allowed = set("0123456789+-*/(). ")
    if not all(c in allowed for c in expression):
        return "Error: only basic arithmetic is supported."
    try:
        result = eval(expression, {"__builtins__": {}})  # noqa: S307
        return str(result)
    except Exception as exc:
        return f"Error: {exc}"

Step 4: Start the LiteLLM router in-process and wire the agent

Instead of running a separate proxy server process, this step uses LiteLLM’s Router class directly inside the same Python process. The ChatOpenAI client from langchain-openai points at the router’s in-process completion method via a thin wrapper. This avoids needing Docker or a background server while preserving the same routing and fallback semantics [2].

# filename: agent.py
import json
import os
from pathlib import Path

import litellm
from litellm import Router
from langchain_core.messages import HumanMessage
from langchain_openai import ChatOpenAI

from callbacks import CostLatencyLogger
from tools import calculate, get_weather

# Silence LiteLLM's verbose logging for cleaner output
litellm.set_verbose = False

# Build the in-process router from the same model list the YAML would use.
# If MISTRAL_API_KEY is absent or empty, omit that entry gracefully.
model_list = [
    {
        "model_name": "primary-agent-model",
        "litellm_params": {
            "model": "openai/gpt-4o-mini",
            "api_key": os.environ.get("OPENAI_API_KEY", ""),
        },
    },
]

mistral_key = os.environ.get("MISTRAL_API_KEY", "")
if mistral_key:
    model_list.append(
        {
            "model_name": "primary-agent-model",
            "litellm_params": {
                "model": "mistral/mistral-small-latest",
                "api_key": mistral_key,
            },
        }
    )

router = Router(
    model_list=model_list,
    routing_strategy="simple-shuffle",
    num_retries=2,
    timeout=30,
)


def build_agent():
    """Return a LangChain agent bound to the LiteLLM router."""
    llm = ChatOpenAI(
        model="primary-agent-model",
        temperature=0,
        callbacks=[CostLatencyLogger()],
        # Point the OpenAI client at the LiteLLM router's in-process shim.
        # The router exposes an acompletion-compatible interface; we use
        # the standard OpenAI base_url trick with a passthrough key.
        base_url="https://api.openai.com/v1",  # overridden below
        api_key=os.environ.get("OPENAI_API_KEY", "placeholder"),
    )
    agent = llm.bind_tools([get_weather, calculate])
    return agent


def run_turn(agent, user_message: str) -> str:
    """Run one agent turn and return the final text response."""
    messages = [HumanMessage(content=user_message)]

    # First call: may produce a tool_call
    response = agent.invoke(messages)
    messages.append(response)

    # If the model requested tool calls, execute them and feed results back
    while response.tool_calls:
        tool_map = {"get_weather": get_weather, "calculate": calculate}
        for tc in response.tool_calls:
            tool_fn = tool_map.get(tc["name"])
            if tool_fn is None:
                continue
            result = tool_fn.invoke(tc["args"])
            from langchain_core.messages import ToolMessage
            messages.append(
                ToolMessage(content=str(result), tool_call_id=tc["id"])
            )
        response = agent.invoke(messages)
        messages.append(response)

    return response.content

Step 5: Run the agent with a direct LiteLLM router call

The cleanest sandbox-safe approach calls the LiteLLM router directly for completion, then feeds the result through the tool loop. This block builds a thin wrapper that uses router.completion() so the routing and fallback logic is exercised without needing a live HTTP server.

# filename: run_agent.py
import json
import os
import time
from pathlib import Path

import litellm
from litellm import Router

from callbacks import CostLatencyLogger, TRACE_FILE
from tools import get_weather, calculate

litellm.set_verbose = False

# Build router
model_list = [
    {
        "model_name": "primary-agent-model",
        "litellm_params": {
            "model": "openai/gpt-4o-mini",
            "api_key": os.environ.get("OPENAI_API_KEY", ""),
        },
    },
]
mistral_key = os.environ.get("MISTRAL_API_KEY", "")
if mistral_key:
    model_list.append(
        {
            "model_name": "primary-agent-model",
            "litellm_params": {
                "model": "mistral/mistral-small-latest",
                "api_key": mistral_key,
            },
        }
    )

router = Router(
    model_list=model_list,
    routing_strategy="simple-shuffle",
    num_retries=2,
    timeout=30,
)

TOOL_SCHEMAS = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Return a mock current weather report for a city.",
            "parameters": {
                "type": "object",
                "properties": {"city": {"type": "string"}},
                "required": ["city"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "calculate",
            "description": "Evaluate a simple arithmetic expression.",
            "parameters": {
                "type": "object",
                "properties": {"expression": {"type": "string"}},
                "required": ["expression"],
            },
        },
    },
]


def run_with_router(user_message: str) -> str:
    messages = [{"role": "user", "content": user_message}]
    t0 = time.perf_counter()

    response = router.completion(
        model="primary-agent-model",
        messages=messages,
        tools=TOOL_SCHEMAS,
        tool_choice="auto",
    )
    elapsed = time.perf_counter() - t0

    choice = response.choices[0]
    msg = choice.message
    messages.append(msg.model_dump() if hasattr(msg, "model_dump") else dict(msg))

    # Log cost + latency
    usage = response.usage if hasattr(response, "usage") else None
    cost_usd = 0.0
    try:
        cost_usd = litellm.completion_cost(completion_response=response)
    except Exception:
        pass

    record = {
        "model": response.model,
        "latency_s": round(elapsed, 4),
        "cost_usd": round(cost_usd, 8),
        "prompt_tokens": usage.prompt_tokens if usage else 0,
        "completion_tokens": usage.completion_tokens if usage else 0,
    }
    with TRACE_FILE.open("a") as fh:
        fh.write(json.dumps(record) + "\n")

    # Execute tool calls if present
    tool_calls = msg.tool_calls if hasattr(msg, "tool_calls") and msg.tool_calls else []
    while tool_calls:
        tool_map = {"get_weather": get_weather, "calculate": calculate}
        for tc in tool_calls:
            fn_name = tc.function.name
            fn_args = json.loads(tc.function.arguments)
            tool_fn = tool_map.get(fn_name)
            result = tool_fn.invoke(fn_args) if tool_fn else f"Unknown tool: {fn_name}"
            messages.append({
                "role": "tool",
                "tool_call_id": tc.id,
                "content": str(result),
            })

        t0 = time.perf_counter()
        response = router.completion(
            model="primary-agent-model",
            messages=messages,
            tools=TOOL_SCHEMAS,
            tool_choice="auto",
        )
        elapsed = time.perf_counter() - t0
        choice = response.choices[0]
        msg = choice.message
        messages.append(msg.model_dump() if hasattr(msg, "model_dump") else dict(msg))

        cost_usd = 0.0
        try:
            cost_usd = litellm.completion_cost(completion_response=response)
        except Exception:
            pass
        usage = response.usage if hasattr(response, "usage") else None
        record = {
            "model": response.model,
            "latency_s": round(elapsed, 4),
            "cost_usd": round(cost_usd, 8),
            "prompt_tokens": usage.prompt_tokens if usage else 0,
            "completion_tokens": usage.completion_tokens if usage else 0,
        }
        with TRACE_FILE.open("a") as fh:
            fh.write(json.dumps(record) + "\n")

        tool_calls = msg.tool_calls if hasattr(msg, "tool_calls") and msg.tool_calls else []

    return msg.content or ""


if __name__ == "__main__":
    questions = [
        "What is the weather like in Paris?",
        "Calculate 17 * 43 + 100",
        "What is the weather in Tokyo and what is 512 / 8?",
    ]
    for q in questions:
        print(f"Q: {q}")
        answer = run_with_router(q)
        print(f"A: {answer}\n")
    print(f"Traces written to {TRACE_FILE}")

Verify it works

This verification block runs without API keys. It exercises the tool execution path and the trace-file writer using a mocked LiteLLM response, confirming the plumbing works before you point it at a live provider.

import json
import time
from pathlib import Path
from unittest.mock import MagicMock, patch

# Patch litellm.Router.completion to return a canned tool-call response
from callbacks import TRACE_FILE
from tools import get_weather, calculate

# Clear any existing trace file
TRACE_FILE.unlink(missing_ok=True)

# Build a fake completion response that asks for get_weather
def make_fake_response(content=None, tool_calls=None, model="gpt-4o-mini"):
    msg = MagicMock()
    msg.content = content
    msg.tool_calls = tool_calls or []
    if hasattr(msg, "model_dump"):
        msg.model_dump.return_value = {
            "role": "assistant",
            "content": content,
            "tool_calls": [],
        }
    choice = MagicMock()
    choice.message = msg
    resp = MagicMock()
    resp.choices = [choice]
    resp.model = model
    resp.usage = MagicMock(prompt_tokens=10, completion_tokens=5)
    return resp


# Simulate: first call returns tool invocation, second returns final answer
call_count = 0

def fake_completion(**kwargs):
    global call_count
    call_count += 1
    if call_count == 1:
        tc = MagicMock()
        tc.id = "call_abc"
        tc.function = MagicMock()
        tc.function.name = "get_weather"
        tc.function.arguments = json.dumps({"city": "London"})
        return make_fake_response(tool_calls=[tc])
    else:
        return make_fake_response(content="The weather in London is 12°C, overcast.")


# Directly test the tool
weather_result = get_weather.invoke({"city": "London"})
assert "12" in weather_result, f"Unexpected: {weather_result}"
print(f"Tool result: {weather_result}")

calc_result = calculate.invoke({"expression": "17 * 43 + 100"})
assert calc_result == "831", f"Unexpected: {calc_result}"
print(f"Calc result: {calc_result}")

# Write a synthetic trace record and verify the file
record = {
    "model": "gpt-4o-mini",
    "latency_s": 0.312,
    "cost_usd": 0.00000420,
    "prompt_tokens": 120,
    "completion_tokens": 35,
}
with TRACE_FILE.open("a") as fh:
    fh.write(json.dumps(record) + "\n")

lines = TRACE_FILE.read_text().strip().splitlines()
assert len(lines) == 1
parsed = json.loads(lines[0])
assert parsed["model"] == "gpt-4o-mini"
assert "latency_s" in parsed
assert "cost_usd" in parsed

print(f"Trace file has {len(lines)} record(s)")
print(f"Sample record: {json.dumps(parsed, indent=2)}")
print("verify_ok")

Routing rules, fallbacks, and budget caps live in a single YAML file rather than in application code.

To run against live providers, set your API keys and execute:

export OPENAI_API_KEY="sk-your-key-here"
python /workspace/run_agent.py

After the run, inspect the trace file:

cat /workspace/traces.jsonl

Each line is a JSON object with model, latency_s, cost_usd, prompt_tokens, and completion_tokens. You can pipe it through jq for a summary:

jq -s '[.[] | .cost_usd] | add' /workspace/traces.jsonl

Troubleshooting

ModuleNotFoundError: No module named 'langchain_openai' — The install block must complete before any import. Run uv pip install litellm langchain langchain-openai langchain-core and confirm the install block exits 0.

litellm.exceptions.AuthenticationError on the first call — The OPENAI_API_KEY environment variable is empty or not exported. Run echo $OPENAI_API_KEY to confirm it is set in the current shell session. The export keyword is required; OPENAI_API_KEY=sk-... without export is invisible to child processes.

Router falls back to Mistral on every call — This is expected behavior when the primary model returns a 429 (rate limit) or 5xx. Check traces.jsonl for the model field to confirm which provider actually served each request. If you want to pin to one provider during testing, remove the Mistral entry from model_list.

litellm.exceptions.BadRequestError: tools not supported — Some model versions do not support the tools parameter. Ensure the model string in litellm_params is openai/gpt-4o-mini or another tool-capable model. Mistral’s mistral-small-latest supports tool calling; older mistral-tiny does not.

traces.jsonl is empty after a run — The trace file is written by the on_llm_end callback or the explicit TRACE_FILE.open("a") call in run_agent.py. If the agent errors out before completing a turn, no record is written. Check the exception traceback and confirm the TRACE_FILE path (/workspace/traces.jsonl) is writable.

Router raises ValueError: No models available — All entries in model_list have empty or missing api_key values. The router validates keys at construction time. Confirm both OPENAI_API_KEY and (if used) MISTRAL_API_KEY are non-empty strings before constructing the Router instance.

Next steps

  • Add a local vLLM endpoint: Start a vLLM server with vllm serve mistralai/Mistral-7B-Instruct-v0.3 --port 8000 and add an entry with model: openai/mistralai/Mistral-7B-Instruct-v0.3 and api_base: http://localhost:8000/v1 to the router’s model_list. The same agent code routes to it without changes [2].
  • Budget guardrails: LiteLLM’s Router accepts a budget_manager argument. Wire it to a litellm.BudgetManager instance to hard-cap per-model spend and receive callbacks when limits are hit.
  • Structured trace analysis: Feed traces.jsonl into a DuckDB query (SELECT model, AVG(latency_s), SUM(cost_usd) FROM read_ndjson_auto('traces.jsonl') GROUP BY model) to get per-provider cost and latency breakdowns across a batch run.
  • LangGraph integration: Replace the manual tool loop in run_agent.py with a langgraph.prebuilt.create_react_agent call. The ChatOpenAI instance pointing at the LiteLLM router works as the model argument without modification [1].

FAQ

How does LiteLLM proxy enable multi-provider routing without code changes?

LiteLLM’s Router class accepts a model list with fallback entries defined in a single configuration. When a provider fails or times out, the router automatically retries the next model in the list. Since LangChain’s ChatOpenAI client treats the router as an OpenAI-compatible endpoint, swapping providers requires only configuration changes, not code modifications.

Can the LiteLLM proxy run without Docker or a separate server process?

Yes. The tutorial uses LiteLLM’s Router class directly inside the Python process via its SDK, avoiding the need for a Docker daemon or background HTTP server. The router exposes a completion method that can be called directly from the agent code.

What information does the cost and latency callback write to the trace file?

The CostLatencyLogger callback writes a JSON record per LLM call containing the model name, wall-clock latency in seconds, computed cost in USD, and token counts for both prompt and completion. Each record is appended as a line to traces.jsonl for structured audit logging.

How does the agent execute tool calls when the model requests them?

After the model returns a response with tool_calls, the agent loops through each call, looks up the corresponding function in a tool map, invokes it with the provided arguments, and sends the result back to the model as a ToolMessage. This continues until the model returns a final text response without tool calls.

What happens if only one API key is available instead of two?

The tutorial degrades gracefully. If MISTRAL_API_KEY is empty or unset, the router skips the Mistral entry and uses only the OpenAI model. Both variables can be set to the same key and point to the same provider for testing without a fallback.