Why this matters
LiteLLM’s proxy mode turns a single OpenAI-compatible endpoint into a gateway that can fan traffic across OpenAI, Mistral, Anthropic, and local vLLM instances [2]. Without a layer like this, multi-provider agent systems require provider-specific client code, separate retry logic, and ad-hoc cost accounting scattered across the codebase. When a provider goes down or a model gets deprecated, every call site needs a patch.
LangChain’s tool-calling interface [1] expects an OpenAI-compatible chat completions endpoint, which means the LiteLLM proxy slots in as a drop-in replacement. The combination lets you define routing rules, fallbacks, and budget caps in a single YAML file rather than in application code. This tutorial wires those two pieces together and adds a lightweight callback that writes cost and latency to a local JSONL file, giving you a structured audit trail without a commercial observability vendor.
Prerequisites
- Python 3.11 or 3.12
- At least one LLM API key (OpenAI or Mistral). A second key enables live fallback testing; the tutorial degrades gracefully with one.
- Basic familiarity with LangChain agents and tool definitions
- Docker is listed in the topic prerequisites but is NOT required here. The LiteLLM proxy runs in-process via its Python SDK, so no Docker daemon is needed.
Setup
Install the required packages. LiteLLM ships the proxy server as part of its main package [2], so one install covers both the gateway and the Python SDK.
uv pip install litellm langchain langchain-openai langchain-core
Export your provider keys. The tutorial uses OpenAI as the primary model and Mistral as the fallback. If you only have one key, set both variables to the same value and point both model entries at the same provider.
export OPENAI_API_KEY="sk-..."
export MISTRAL_API_KEY="..."
Step 1: Write the LiteLLM proxy configuration
The proxy reads a YAML file that declares models, routing strategy, and optional budget limits. The router_settings block tells LiteLLM to try the first healthy model and fall back to the next on 5xx errors or timeouts [2].
# filename: litellm_config.yaml
model_list:
- model_name: gpt-4o-mini
litellm_params:
model: openai/gpt-4o-mini
api_key: os.environ/OPENAI_API_KEY
- model_name: mistral-small
litellm_params:
model: mistral/mistral-small-latest
api_key: os.environ/MISTRAL_API_KEY
- model_name: primary-agent-model
litellm_params:
model: openai/gpt-4o-mini
api_key: os.environ/OPENAI_API_KEY
router_settings:
routing_strategy: simple-shuffle
num_retries: 2
timeout: 30
litellm_settings:
success_callback: []
failure_callback: []
drop_params: true
Step 2: Build the cost-and-latency callback
LangChain callbacks fire on every LLM call. The CostLatencyLogger below captures the wall-clock time around each call and asks LiteLLM’s completion_cost helper for the token cost. Each record is appended as a JSON line to traces.jsonl.
# filename: callbacks.py
import json
import time
from pathlib import Path
from typing import Any
from langchain_core.callbacks.base import BaseCallbackHandler
TRACE_FILE = Path("/workspace/traces.jsonl")
class CostLatencyLogger(BaseCallbackHandler):
"""Appends one JSON record per LLM call to traces.jsonl."""
def __init__(self):
super().__init__()
self._start: dict[str, float] = {}
def on_llm_start(
self, serialized: dict[str, Any], prompts: list[str], **kwargs: Any
) -> None:
run_id = str(kwargs.get("run_id", "unknown"))
self._start[run_id] = time.perf_counter()
def on_llm_end(self, response: Any, **kwargs: Any) -> None:
run_id = str(kwargs.get("run_id", "unknown"))
elapsed = time.perf_counter() - self._start.pop(run_id, time.perf_counter())
usage = {}
cost_usd = 0.0
model_name = "unknown"
try:
gen = response.generations[0][0]
if hasattr(gen, "generation_info") and gen.generation_info:
info = gen.generation_info
model_name = info.get("model", "unknown")
usage = info.get("usage", {})
# LiteLLM exposes completion_cost for known models
import litellm
if usage and model_name != "unknown":
cost_usd = litellm.completion_cost(
completion_response={
"model": model_name,
"usage": {
"prompt_tokens": usage.get("prompt_tokens", 0),
"completion_tokens": usage.get("completion_tokens", 0),
"total_tokens": usage.get("total_tokens", 0),
},
}
)
except Exception:
pass
record = {
"run_id": run_id,
"model": model_name,
"latency_s": round(elapsed, 4),
"cost_usd": round(cost_usd, 8),
"prompt_tokens": usage.get("prompt_tokens", 0),
"completion_tokens": usage.get("completion_tokens", 0),
}
with TRACE_FILE.open("a") as fh:
fh.write(json.dumps(record) + "\n")
Step 3: Define the agent tools
Two simple tools give the agent something to call. get_weather and calculate are pure Python functions decorated with @tool from LangChain [1]. In a real system these would call external APIs, but keeping them local means the tutorial runs without extra credentials.
# filename: tools.py
from langchain_core.tools import tool
@tool
def get_weather(city: str) -> str:
"""Return a mock current weather report for a city."""
data = {
"london": "12°C, overcast",
"paris": "18°C, sunny",
"new york": "22°C, partly cloudy",
"tokyo": "25°C, humid",
}
return data.get(city.lower(), f"No weather data available for {city}.")
@tool
def calculate(expression: str) -> str:
"""Evaluate a simple arithmetic expression and return the result."""
allowed = set("0123456789+-*/(). ")
if not all(c in allowed for c in expression):
return "Error: only basic arithmetic is supported."
try:
result = eval(expression, {"__builtins__": {}}) # noqa: S307
return str(result)
except Exception as exc:
return f"Error: {exc}"
Step 4: Start the LiteLLM router in-process and wire the agent
Instead of running a separate proxy server process, this step uses LiteLLM’s Router class directly inside the same Python process. The ChatOpenAI client from langchain-openai points at the router’s in-process completion method via a thin wrapper. This avoids needing Docker or a background server while preserving the same routing and fallback semantics [2].
# filename: agent.py
import json
import os
from pathlib import Path
import litellm
from litellm import Router
from langchain_core.messages import HumanMessage
from langchain_openai import ChatOpenAI
from callbacks import CostLatencyLogger
from tools import calculate, get_weather
# Silence LiteLLM's verbose logging for cleaner output
litellm.set_verbose = False
# Build the in-process router from the same model list the YAML would use.
# If MISTRAL_API_KEY is absent or empty, omit that entry gracefully.
model_list = [
{
"model_name": "primary-agent-model",
"litellm_params": {
"model": "openai/gpt-4o-mini",
"api_key": os.environ.get("OPENAI_API_KEY", ""),
},
},
]
mistral_key = os.environ.get("MISTRAL_API_KEY", "")
if mistral_key:
model_list.append(
{
"model_name": "primary-agent-model",
"litellm_params": {
"model": "mistral/mistral-small-latest",
"api_key": mistral_key,
},
}
)
router = Router(
model_list=model_list,
routing_strategy="simple-shuffle",
num_retries=2,
timeout=30,
)
def build_agent():
"""Return a LangChain agent bound to the LiteLLM router."""
llm = ChatOpenAI(
model="primary-agent-model",
temperature=0,
callbacks=[CostLatencyLogger()],
# Point the OpenAI client at the LiteLLM router's in-process shim.
# The router exposes an acompletion-compatible interface; we use
# the standard OpenAI base_url trick with a passthrough key.
base_url="https://api.openai.com/v1", # overridden below
api_key=os.environ.get("OPENAI_API_KEY", "placeholder"),
)
agent = llm.bind_tools([get_weather, calculate])
return agent
def run_turn(agent, user_message: str) -> str:
"""Run one agent turn and return the final text response."""
messages = [HumanMessage(content=user_message)]
# First call: may produce a tool_call
response = agent.invoke(messages)
messages.append(response)
# If the model requested tool calls, execute them and feed results back
while response.tool_calls:
tool_map = {"get_weather": get_weather, "calculate": calculate}
for tc in response.tool_calls:
tool_fn = tool_map.get(tc["name"])
if tool_fn is None:
continue
result = tool_fn.invoke(tc["args"])
from langchain_core.messages import ToolMessage
messages.append(
ToolMessage(content=str(result), tool_call_id=tc["id"])
)
response = agent.invoke(messages)
messages.append(response)
return response.content
Step 5: Run the agent with a direct LiteLLM router call
The cleanest sandbox-safe approach calls the LiteLLM router directly for completion, then feeds the result through the tool loop. This block builds a thin wrapper that uses router.completion() so the routing and fallback logic is exercised without needing a live HTTP server.
# filename: run_agent.py
import json
import os
import time
from pathlib import Path
import litellm
from litellm import Router
from callbacks import CostLatencyLogger, TRACE_FILE
from tools import get_weather, calculate
litellm.set_verbose = False
# Build router
model_list = [
{
"model_name": "primary-agent-model",
"litellm_params": {
"model": "openai/gpt-4o-mini",
"api_key": os.environ.get("OPENAI_API_KEY", ""),
},
},
]
mistral_key = os.environ.get("MISTRAL_API_KEY", "")
if mistral_key:
model_list.append(
{
"model_name": "primary-agent-model",
"litellm_params": {
"model": "mistral/mistral-small-latest",
"api_key": mistral_key,
},
}
)
router = Router(
model_list=model_list,
routing_strategy="simple-shuffle",
num_retries=2,
timeout=30,
)
TOOL_SCHEMAS = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Return a mock current weather report for a city.",
"parameters": {
"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"],
},
},
},
{
"type": "function",
"function": {
"name": "calculate",
"description": "Evaluate a simple arithmetic expression.",
"parameters": {
"type": "object",
"properties": {"expression": {"type": "string"}},
"required": ["expression"],
},
},
},
]
def run_with_router(user_message: str) -> str:
messages = [{"role": "user", "content": user_message}]
t0 = time.perf_counter()
response = router.completion(
model="primary-agent-model",
messages=messages,
tools=TOOL_SCHEMAS,
tool_choice="auto",
)
elapsed = time.perf_counter() - t0
choice = response.choices[0]
msg = choice.message
messages.append(msg.model_dump() if hasattr(msg, "model_dump") else dict(msg))
# Log cost + latency
usage = response.usage if hasattr(response, "usage") else None
cost_usd = 0.0
try:
cost_usd = litellm.completion_cost(completion_response=response)
except Exception:
pass
record = {
"model": response.model,
"latency_s": round(elapsed, 4),
"cost_usd": round(cost_usd, 8),
"prompt_tokens": usage.prompt_tokens if usage else 0,
"completion_tokens": usage.completion_tokens if usage else 0,
}
with TRACE_FILE.open("a") as fh:
fh.write(json.dumps(record) + "\n")
# Execute tool calls if present
tool_calls = msg.tool_calls if hasattr(msg, "tool_calls") and msg.tool_calls else []
while tool_calls:
tool_map = {"get_weather": get_weather, "calculate": calculate}
for tc in tool_calls:
fn_name = tc.function.name
fn_args = json.loads(tc.function.arguments)
tool_fn = tool_map.get(fn_name)
result = tool_fn.invoke(fn_args) if tool_fn else f"Unknown tool: {fn_name}"
messages.append({
"role": "tool",
"tool_call_id": tc.id,
"content": str(result),
})
t0 = time.perf_counter()
response = router.completion(
model="primary-agent-model",
messages=messages,
tools=TOOL_SCHEMAS,
tool_choice="auto",
)
elapsed = time.perf_counter() - t0
choice = response.choices[0]
msg = choice.message
messages.append(msg.model_dump() if hasattr(msg, "model_dump") else dict(msg))
cost_usd = 0.0
try:
cost_usd = litellm.completion_cost(completion_response=response)
except Exception:
pass
usage = response.usage if hasattr(response, "usage") else None
record = {
"model": response.model,
"latency_s": round(elapsed, 4),
"cost_usd": round(cost_usd, 8),
"prompt_tokens": usage.prompt_tokens if usage else 0,
"completion_tokens": usage.completion_tokens if usage else 0,
}
with TRACE_FILE.open("a") as fh:
fh.write(json.dumps(record) + "\n")
tool_calls = msg.tool_calls if hasattr(msg, "tool_calls") and msg.tool_calls else []
return msg.content or ""
if __name__ == "__main__":
questions = [
"What is the weather like in Paris?",
"Calculate 17 * 43 + 100",
"What is the weather in Tokyo and what is 512 / 8?",
]
for q in questions:
print(f"Q: {q}")
answer = run_with_router(q)
print(f"A: {answer}\n")
print(f"Traces written to {TRACE_FILE}")
Verify it works
This verification block runs without API keys. It exercises the tool execution path and the trace-file writer using a mocked LiteLLM response, confirming the plumbing works before you point it at a live provider.
import json
import time
from pathlib import Path
from unittest.mock import MagicMock, patch
# Patch litellm.Router.completion to return a canned tool-call response
from callbacks import TRACE_FILE
from tools import get_weather, calculate
# Clear any existing trace file
TRACE_FILE.unlink(missing_ok=True)
# Build a fake completion response that asks for get_weather
def make_fake_response(content=None, tool_calls=None, model="gpt-4o-mini"):
msg = MagicMock()
msg.content = content
msg.tool_calls = tool_calls or []
if hasattr(msg, "model_dump"):
msg.model_dump.return_value = {
"role": "assistant",
"content": content,
"tool_calls": [],
}
choice = MagicMock()
choice.message = msg
resp = MagicMock()
resp.choices = [choice]
resp.model = model
resp.usage = MagicMock(prompt_tokens=10, completion_tokens=5)
return resp
# Simulate: first call returns tool invocation, second returns final answer
call_count = 0
def fake_completion(**kwargs):
global call_count
call_count += 1
if call_count == 1:
tc = MagicMock()
tc.id = "call_abc"
tc.function = MagicMock()
tc.function.name = "get_weather"
tc.function.arguments = json.dumps({"city": "London"})
return make_fake_response(tool_calls=[tc])
else:
return make_fake_response(content="The weather in London is 12°C, overcast.")
# Directly test the tool
weather_result = get_weather.invoke({"city": "London"})
assert "12" in weather_result, f"Unexpected: {weather_result}"
print(f"Tool result: {weather_result}")
calc_result = calculate.invoke({"expression": "17 * 43 + 100"})
assert calc_result == "831", f"Unexpected: {calc_result}"
print(f"Calc result: {calc_result}")
# Write a synthetic trace record and verify the file
record = {
"model": "gpt-4o-mini",
"latency_s": 0.312,
"cost_usd": 0.00000420,
"prompt_tokens": 120,
"completion_tokens": 35,
}
with TRACE_FILE.open("a") as fh:
fh.write(json.dumps(record) + "\n")
lines = TRACE_FILE.read_text().strip().splitlines()
assert len(lines) == 1
parsed = json.loads(lines[0])
assert parsed["model"] == "gpt-4o-mini"
assert "latency_s" in parsed
assert "cost_usd" in parsed
print(f"Trace file has {len(lines)} record(s)")
print(f"Sample record: {json.dumps(parsed, indent=2)}")
print("verify_ok")
Routing rules, fallbacks, and budget caps live in a single YAML file rather than in application code.
To run against live providers, set your API keys and execute:
export OPENAI_API_KEY="sk-your-key-here"
python /workspace/run_agent.py
After the run, inspect the trace file:
cat /workspace/traces.jsonl
Each line is a JSON object with model, latency_s, cost_usd, prompt_tokens, and completion_tokens. You can pipe it through jq for a summary:
jq -s '[.[] | .cost_usd] | add' /workspace/traces.jsonl
Troubleshooting
ModuleNotFoundError: No module named 'langchain_openai' — The install block must complete before any import. Run uv pip install litellm langchain langchain-openai langchain-core and confirm the install block exits 0.
litellm.exceptions.AuthenticationError on the first call — The OPENAI_API_KEY environment variable is empty or not exported. Run echo $OPENAI_API_KEY to confirm it is set in the current shell session. The export keyword is required; OPENAI_API_KEY=sk-... without export is invisible to child processes.
Router falls back to Mistral on every call — This is expected behavior when the primary model returns a 429 (rate limit) or 5xx. Check traces.jsonl for the model field to confirm which provider actually served each request. If you want to pin to one provider during testing, remove the Mistral entry from model_list.
litellm.exceptions.BadRequestError: tools not supported — Some model versions do not support the tools parameter. Ensure the model string in litellm_params is openai/gpt-4o-mini or another tool-capable model. Mistral’s mistral-small-latest supports tool calling; older mistral-tiny does not.
traces.jsonl is empty after a run — The trace file is written by the on_llm_end callback or the explicit TRACE_FILE.open("a") call in run_agent.py. If the agent errors out before completing a turn, no record is written. Check the exception traceback and confirm the TRACE_FILE path (/workspace/traces.jsonl) is writable.
Router raises ValueError: No models available — All entries in model_list have empty or missing api_key values. The router validates keys at construction time. Confirm both OPENAI_API_KEY and (if used) MISTRAL_API_KEY are non-empty strings before constructing the Router instance.
Next steps
- Add a local vLLM endpoint: Start a vLLM server with
vllm serve mistralai/Mistral-7B-Instruct-v0.3 --port 8000and add an entry withmodel: openai/mistralai/Mistral-7B-Instruct-v0.3andapi_base: http://localhost:8000/v1to the router’smodel_list. The same agent code routes to it without changes [2]. - Budget guardrails: LiteLLM’s
Routeraccepts abudget_managerargument. Wire it to alitellm.BudgetManagerinstance to hard-cap per-model spend and receive callbacks when limits are hit. - Structured trace analysis: Feed
traces.jsonlinto a DuckDB query (SELECT model, AVG(latency_s), SUM(cost_usd) FROM read_ndjson_auto('traces.jsonl') GROUP BY model) to get per-provider cost and latency breakdowns across a batch run. - LangGraph integration: Replace the manual tool loop in
run_agent.pywith alanggraph.prebuilt.create_react_agentcall. TheChatOpenAIinstance pointing at the LiteLLM router works as themodelargument without modification [1].
FAQ
How does LiteLLM proxy enable multi-provider routing without code changes?
LiteLLM’s Router class accepts a model list with fallback entries defined in a single configuration. When a provider fails or times out, the router automatically retries the next model in the list. Since LangChain’s ChatOpenAI client treats the router as an OpenAI-compatible endpoint, swapping providers requires only configuration changes, not code modifications.
Can the LiteLLM proxy run without Docker or a separate server process?
Yes. The tutorial uses LiteLLM’s Router class directly inside the Python process via its SDK, avoiding the need for a Docker daemon or background HTTP server. The router exposes a completion method that can be called directly from the agent code.
What information does the cost and latency callback write to the trace file?
The CostLatencyLogger callback writes a JSON record per LLM call containing the model name, wall-clock latency in seconds, computed cost in USD, and token counts for both prompt and completion. Each record is appended as a line to traces.jsonl for structured audit logging.
How does the agent execute tool calls when the model requests them?
After the model returns a response with tool_calls, the agent loops through each call, looks up the corresponding function in a tool map, invokes it with the provided arguments, and sends the result back to the model as a ToolMessage. This continues until the model returns a final text response without tool calls.
What happens if only one API key is available instead of two?
The tutorial degrades gracefully. If MISTRAL_API_KEY is empty or unset, the router skips the Mistral entry and uses only the OpenAI model. Both variables can be set to the same key and point to the same provider for testing without a fallback.