Why this matters
The AgenticRAG paper [2] reports that layering a lightweight agentic harness on top of standard retrieval infrastructure improves retrieval quality: 49.6% recall@1 on BRIGHT (+21.8 pp over the best embedding baseline) and 0.96 factuality on WixQA. The most significant factor, per the ablation study, is the shift from single-shot retrieval to agentic tool use, which alone accounts for a 5.9x improvement [2].
Most production RAG pipelines today still use single-shot retrieval: embed the query, fetch top-k chunks, stuff them into a prompt, and generate. That architecture forces the retrieval stack to do all the grounding work in one pass, with no mechanism for the model to ask follow-up questions, navigate within a document, or validate a partial answer before committing. When a question requires synthesizing evidence from multiple documents or resolving ambiguous terminology, a single-shot pipeline returns an answer with no signal that its retrieval was incomplete. LangGraph’s graph primitives [1] give you the control flow to implement iterative retrieval without writing a custom state machine from scratch.
Prerequisites
- Python 3.11 or 3.12
- An OpenAI or Anthropic API key (blocks that call the LLM are marked
skip_execution_reason; all other blocks run in the sandbox) - Familiarity with
async/awaitin Python - Basic understanding of LangGraph nodes and edges [1]
Setup
Install the required packages. The tutorial uses langchain-openai for the LLM binding; swap in langchain-anthropic if you prefer Claude.
uv pip install langgraph langchain-openai langchain-core tiktoken numpy
Export your API key before running the agent steps:
export OPENAI_API_KEY="sk-..." # replace with your actual key
Step 1: Build the in-memory document corpus
The corpus is a small collection of plain-text documents stored in a Python dict. A real deployment would point at an enterprise search index, but an in-memory store is enough to demonstrate the agentic loop and run the eval harness without network dependencies.
# filename: corpus.py
from __future__ import annotations
import re
from typing import Any
DOCUMENTS: dict[str, str] = {
"doc_climate_001": """
Global average temperatures have risen by approximately 1.1 degrees Celsius
since pre-industrial times. The Intergovernmental Panel on Climate Change (IPCC)
projects that limiting warming to 1.5 C requires net-zero CO2 emissions by 2050.
Renewable energy capacity grew by 295 GW in 2022, the largest annual increase on record.
Solar photovoltaic installations accounted for 60 percent of that growth.
""",
"doc_climate_002": """
Arctic sea ice extent reached a record low in September 2012 at 3.41 million km2.
Since 1979, Arctic sea ice has declined at a rate of roughly 13 percent per decade.
Permafrost thaw releases methane, a greenhouse gas with 80 times the warming
potential of CO2 over a 20-year horizon.
""",
"doc_finance_001": """
The S&P 500 index returned 26.3 percent in 2023, recovering from a 19.4 percent
decline in 2022. The Federal Reserve raised the federal funds rate to a target range
of 5.25 to 5.50 percent in July 2023, the highest level in 22 years.
Inflation as measured by CPI fell from a peak of 9.1 percent in June 2022
to 3.4 percent by December 2023.
""",
"doc_finance_002": """
Venture capital investment in AI startups reached 91.9 billion USD in 2023,
representing 30 percent of all global VC funding. Generative AI companies
raised 25.2 billion USD, a 9x increase from 2022 levels.
The median pre-money valuation for Series A AI rounds was 42 million USD.
""",
"doc_biology_001": """
CRISPR-Cas9 gene editing was first demonstrated in human cells in 2013 by
the Zhang and Doudna-Charpentier groups. The first approved CRISPR therapy,
Casgevy, received FDA approval in December 2023 for sickle cell disease
and beta-thalassemia. The editing efficiency in clinical trials exceeded 90 percent.
""",
"doc_biology_002": """
The human genome contains approximately 3.2 billion base pairs encoding
around 20,000 protein-coding genes. Only about 1.5 percent of the genome
codes for proteins; the remainder includes regulatory elements, introns,
and regions of unknown function sometimes called 'dark matter' DNA.
""",
}
def _tokenize(text: str) -> list[str]:
return re.findall(r"[a-z0-9]+", text.lower())
def bm25_search(query: str, top_k: int = 3) -> list[dict[str, Any]]:
"""Minimal BM25-style TF-IDF search over DOCUMENTS."""
k1, b = 1.5, 0.75
query_tokens = set(_tokenize(query))
doc_lengths = {doc_id: len(_tokenize(text)) for doc_id, text in DOCUMENTS.items()}
avg_len = sum(doc_lengths.values()) / len(doc_lengths)
scores: dict[str, float] = {}
for doc_id, text in DOCUMENTS.items():
tokens = _tokenize(text)
tf: dict[str, int] = {}
for t in tokens:
tf[t] = tf.get(t, 0) + 1
score = 0.0
for term in query_tokens:
if term not in tf:
continue
df = sum(1 for d in DOCUMENTS.values() if term in _tokenize(d))
idf = (len(DOCUMENTS) - df + 0.5) / (df + 0.5)
tf_norm = (tf[term] * (k1 + 1)) / (
tf[term] + k1 * (1 - b + b * doc_lengths[doc_id] / avg_len)
)
score += idf * tf_norm
scores[doc_id] = score
ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)
return [
{"doc_id": doc_id, "score": round(score, 3), "snippet": DOCUMENTS[doc_id][:200].strip()}
for doc_id, score in ranked[:top_k]
if score > 0
]
def open_document(doc_id: str) -> str:
"""Return the full text of a document by ID."""
return DOCUMENTS.get(doc_id, f"Document '{doc_id}' not found.")
def find_in_document(doc_id: str, keyword: str) -> list[str]:
"""Return sentences from a document that contain the keyword."""
text = DOCUMENTS.get(doc_id, "")
sentences = re.split(r"(?<=[.!?])\s+", text.strip())
keyword_lower = keyword.lower()
return [s.strip() for s in sentences if keyword_lower in s.lower()]
Verify the search function works before wiring it into the agent:
from corpus import bm25_search, open_document, find_in_document
results = bm25_search("CRISPR gene editing FDA approval", top_k=2)
for r in results:
print(r["doc_id"], r["score"])
sentences = find_in_document("doc_biology_001", "CRISPR")
print("\nMatching sentences:")
for s in sentences:
print(" -", s)
print("\ncorpus_ok")
Step 2: Define the four tools
The AgenticRAG harness exposes four tools to the model: search, find, open, and summarize [2]. Each tool is a plain Python function wrapped with LangChain’s @tool decorator so LangGraph can bind it to the LLM.
# filename: tools.py
from __future__ import annotations
from langchain_core.tools import tool
from corpus import bm25_search, open_document, find_in_document
@tool
def search(query: str) -> str:
"""Search the document corpus for passages relevant to the query.
Returns a ranked list of document IDs with short snippets.
Use this to discover which documents are relevant before opening them.
"""
results = bm25_search(query, top_k=3)
if not results:
return "No documents matched the query."
lines = []
for r in results:
lines.append(f"[{r['doc_id']}] score={r['score']}\n {r['snippet']}")
return "\n\n".join(lines)
@tool
def find(doc_id: str, keyword: str) -> str:
"""Find sentences containing a keyword inside a specific document.
Use this to navigate within a document without reading the full text.
Args:
doc_id: The document identifier returned by the search tool.
keyword: A word or short phrase to locate inside the document.
"""
hits = find_in_document(doc_id, keyword)
if not hits:
return f"Keyword '{keyword}' not found in {doc_id}."
return "\n".join(f"- {s}" for s in hits)
@tool
def open_doc(doc_id: str) -> str:
"""Open and return the full text of a document.
Use this when snippets are insufficient and you need the complete content.
Args:
doc_id: The document identifier returned by the search tool.
"""
return open_document(doc_id)
@tool
def summarize(doc_id: str, focus: str) -> str:
"""Return the portion of a document most relevant to a focus topic.
This is a lightweight extractive summary: it returns sentences that
contain any word from the focus phrase.
Args:
doc_id: The document identifier.
focus: A topic or question to focus the summary on.
"""
hits = find_in_document(doc_id, focus.split()[0])
if not hits:
return f"No content about '{focus}' found in {doc_id}."
return "\n".join(f"- {s}" for s in hits[:5])
from tools import search, find, open_doc, summarize
print(search.invoke({"query": "Arctic sea ice decline rate"}))
print("\ntools_ok")
Step 3: Build the LangGraph agent
The agent is a StateGraph with two nodes: agent (the LLM deciding what to do next) and tools (executing the chosen tool). A conditional edge routes back to agent after each tool call, or exits to END when the model produces a final answer without a tool call [1].
# filename: agent.py
from __future__ import annotations
import os
from typing import Annotated
from langchain_core.messages import BaseMessage, SystemMessage
from langchain_openai import ChatOpenAI
from langgraph.graph import StateGraph, END
from langgraph.graph.message import add_messages
from langgraph.prebuilt import ToolNode
from typing_extensions import TypedDict
from tools import search, find, open_doc, summarize
TOOLS = [search, find, open_doc, summarize]
SYSTEM_PROMPT = """You are a research assistant with access to a document corpus.
Use the available tools to iteratively retrieve and validate information before answering.
Strategy:
1. Start with `search` to identify relevant documents.
2. Use `find` to locate specific facts within a document without reading everything.
3. Use `open_doc` when you need the full document text.
4. Use `summarize` to extract focused content on a sub-topic.
5. Repeat steps 1-4 as needed until you have sufficient evidence.
6. Provide a concise, grounded answer citing the document IDs you used.
Never guess. If the corpus does not contain the answer, say so explicitly."""
class AgentState(TypedDict):
messages: Annotated[list[BaseMessage], add_messages]
def build_agent(model_name: str = "gpt-4o-mini") -> object:
llm = ChatOpenAI(model=model_name, temperature=0)
llm_with_tools = llm.bind_tools(TOOLS)
def agent_node(state: AgentState) -> AgentState:
messages = [SystemMessage(content=SYSTEM_PROMPT)] + state["messages"]
response = llm_with_tools.invoke(messages)
return {"messages": [response]}
def should_continue(state: AgentState) -> str:
last = state["messages"][-1]
if hasattr(last, "tool_calls") and last.tool_calls:
return "tools"
return END
tool_node = ToolNode(TOOLS)
graph = StateGraph(AgentState)
graph.add_node("agent", agent_node)
graph.add_node("tools", tool_node)
graph.set_entry_point("agent")
graph.add_conditional_edges("agent", should_continue, {"tools": "tools", END: END})
graph.add_edge("tools", "agent")
return graph.compile()
Verify the graph compiles and its structure is correct:
from agent import build_agent
app = build_agent()
print("Nodes:", list(app.get_graph().nodes.keys()))
print("agent_graph_ok")
Step 4: Run the agent on a sample question
This block requires a live OpenAI API key. It demonstrates the iterative tool-calling loop on a question that requires synthesizing two documents.
from langchain_core.messages import HumanMessage
from agent import build_agent
app = build_agent()
question = "What was the rate of Arctic sea ice decline per decade, and what greenhouse gas does permafrost thaw release?"
result = app.invoke({"messages": [HumanMessage(content=question)]})
for msg in result["messages"]:
role = getattr(msg, "type", type(msg).__name__)
content = msg.content if isinstance(msg.content, str) else str(msg.content)
if content:
print(f"[{role}] {content[:300]}")
print()
Step 5: Build the evaluation harness
The eval harness measures two metrics that mirror the AgenticRAG paper [2]: recall@1 (did the agent retrieve the correct document at least once?) and factuality (does the final answer contain the expected key facts?).
# filename: eval_harness.py
from __future__ import annotations
import re
from dataclasses import dataclass, field
from typing import Any
from langchain_core.messages import HumanMessage, ToolMessage
from agent import build_agent
@dataclass
class EvalCase:
question: str
relevant_doc_ids: list[str] # docs that must be retrieved for recall@1
key_facts: list[str] # substrings that must appear in the final answer
EVAL_SET: list[EvalCase] = [
EvalCase(
question="What percentage of global VC funding did AI startups represent in 2023?",
relevant_doc_ids=["doc_finance_002"],
key_facts=["30"],
),
EvalCase(
question="When did the FDA approve the first CRISPR therapy and for which diseases?",
relevant_doc_ids=["doc_biology_001"],
key_facts=["December 2023", "sickle cell"],
),
EvalCase(
question="What was the S&P 500 return in 2023?",
relevant_doc_ids=["doc_finance_001"],
key_facts=["26.3"],
),
EvalCase(
question="How much of the human genome codes for proteins?",
relevant_doc_ids=["doc_biology_002"],
key_facts=["1.5"],
),
EvalCase(
question="What was the largest annual increase in renewable energy capacity and which technology led it?",
relevant_doc_ids=["doc_climate_001"],
key_facts=["295", "solar"],
),
]
def extract_retrieved_doc_ids(messages: list[Any]) -> set[str]:
"""Collect all doc_ids that appeared in tool call arguments or results."""
ids: set[str] = set()
for msg in messages:
# Check tool call arguments
if hasattr(msg, "tool_calls"):
for tc in msg.tool_calls:
args = tc.get("args", {})
for v in args.values():
if isinstance(v, str) and v.startswith("doc_"):
ids.add(v)
# Check tool message content for doc IDs
if isinstance(msg, ToolMessage):
found = re.findall(r"doc_[a-z]+_\d+", msg.content)
ids.update(found)
return ids
def recall_at_1(retrieved: set[str], relevant: list[str]) -> float:
"""1.0 if any relevant doc was retrieved, else 0.0."""
return 1.0 if any(d in retrieved for d in relevant) else 0.0
def factuality_score(answer: str, key_facts: list[str]) -> float:
"""Fraction of key facts present in the answer (case-insensitive)."""
if not key_facts:
return 1.0
hits = sum(1 for f in key_facts if f.lower() in answer.lower())
return hits / len(key_facts)
def run_eval(model_name: str = "gpt-4o-mini") -> dict[str, Any]:
app = build_agent(model_name)
results = []
for case in EVAL_SET:
state = app.invoke({"messages": [HumanMessage(content=case.question)]})
messages = state["messages"]
# Final answer is the last AI message with non-empty text content
final_answer = ""
for msg in reversed(messages):
content = msg.content if isinstance(msg.content, str) else ""
if content and not getattr(msg, "tool_calls", None):
final_answer = content
break
retrieved = extract_retrieved_doc_ids(messages)
r1 = recall_at_1(retrieved, case.relevant_doc_ids)
fact = factuality_score(final_answer, case.key_facts)
results.append({
"question": case.question[:60],
"recall@1": r1,
"factuality": round(fact, 2),
"retrieved": sorted(retrieved),
"answer_snippet": final_answer[:120],
})
avg_recall = sum(r["recall@1"] for r in results) / len(results)
avg_factuality = sum(r["factuality"] for r in results) / len(results)
return {
"per_case": results,
"avg_recall@1": round(avg_recall, 3),
"avg_factuality": round(avg_factuality, 3),
}
Step 6: Compare single-shot vs. agentic retrieval
To reproduce the paper’s core finding [2], run a baseline that does a single BM25 search and answers directly, then compare it to the full agentic pipeline.
# filename: baseline.py
from __future__ import annotations
from langchain_core.messages import HumanMessage, SystemMessage
from langchain_openai import ChatOpenAI
from corpus import bm25_search
from eval_harness import EVAL_SET, factuality_score
def run_baseline(model_name: str = "gpt-4o-mini") -> dict:
llm = ChatOpenAI(model=model_name, temperature=0)
results = []
for case in EVAL_SET:
# Single-shot: one search, stuff top-3 into the prompt, generate
hits = bm25_search(case.question, top_k=3)
context = "\n\n".join(
f"[{h['doc_id']}]\n{h['snippet']}" for h in hits
)
messages = [
SystemMessage(content="Answer the question using only the provided context."),
HumanMessage(content=f"Context:\n{context}\n\nQuestion: {case.question}"),
]
response = llm.invoke(messages)
answer = response.content
retrieved = {h["doc_id"] for h in hits}
r1 = 1.0 if any(d in retrieved for d in case.relevant_doc_ids) else 0.0
fact = factuality_score(answer, case.key_facts)
results.append({
"question": case.question[:60],
"recall@1": r1,
"factuality": round(fact, 2),
})
avg_recall = sum(r["recall@1"] for r in results) / len(results)
avg_factuality = sum(r["factuality"] for r in results) / len(results)
return {
"per_case": results,
"avg_recall@1": round(avg_recall, 3),
"avg_factuality": round(avg_factuality, 3),
}
Verify it works
This block runs the full comparison without an API key by exercising only the corpus and eval infrastructure. The agent and baseline blocks that call the LLM are marked to skip.
from corpus import bm25_search, find_in_document
from tools import search, find, open_doc, summarize
from eval_harness import (
EVAL_SET,
extract_retrieved_doc_ids,
recall_at_1,
factuality_score,
)
from agent import build_agent
# Verify corpus search
assert len(bm25_search("CRISPR FDA approval", top_k=2)) >= 1, "search returned nothing"
# Verify find
hits = find_in_document("doc_biology_001", "CRISPR")
assert len(hits) >= 1, "find returned nothing"
# Verify tool wrappers
result = search.invoke({"query": "Arctic sea ice"})
assert "doc_climate" in result, "search tool did not return climate doc"
find_result = find.invoke({"doc_id": "doc_climate_002", "keyword": "methane"})
assert "methane" in find_result.lower(), "find tool did not locate methane"
# Verify eval metrics
fake_retrieved = {"doc_finance_002"}
assert recall_at_1(fake_retrieved, ["doc_finance_002"]) == 1.0
assert recall_at_1(set(), ["doc_finance_002"]) == 0.0
assert factuality_score("AI raised 30 percent of VC", ["30"]) == 1.0
assert factuality_score("no relevant info", ["30"]) == 0.0
# Verify graph compiles
app = build_agent()
nodes = list(app.get_graph().nodes.keys())
assert "agent" in nodes and "tools" in nodes, f"unexpected nodes: {nodes}"
print(f"Eval set size: {len(EVAL_SET)} cases")
print(f"Graph nodes: {nodes}")
print("all_checks_passed")
To run the live eval against the API (requires your key to be set), execute:
import json
from eval_harness import run_eval
from baseline import run_baseline
baseline = run_baseline()
agentic = run_eval()
print("=== Baseline (single-shot) ===")
print(f" avg recall@1: {baseline['avg_recall@1']}")
print(f" avg factuality: {baseline['avg_factuality']}")
print("\n=== Agentic RAG ===")
print(f" avg recall@1: {agentic['avg_recall@1']}")
print(f" avg factuality: {agentic['avg_factuality']}")
print("\n=== Per-case breakdown (agentic) ===")
for r in agentic["per_case"]:
print(f" Q: {r['question']}")
print(f" recall@1={r['recall@1']} factuality={r['factuality']}")
print(f" retrieved={r['retrieved']}")
The most significant factor is the shift from single-shot retrieval to agentic tool use, accounting for a 5.9x improvement in the AgenticRAG ablation study.
Troubleshooting
ModuleNotFoundError: No module named 'langchain_openai': Run uv pip install langchain-openai (note the hyphen). The package name on PyPI uses a hyphen but imports with an underscore.
The agent loops indefinitely without reaching END: The should_continue function checks last.tool_calls. If your LLM version returns tool calls in a different attribute, add a max_iterations guard: pass recursion_limit=10 to app.invoke({...}, config={"recursion_limit": 10}).
recall@1 is 0.0 for all cases in the baseline: BM25 on short snippets is sensitive to query phrasing. Try expanding the top_k argument in run_baseline from 3 to 5, or add synonyms to the query before searching.
factuality_score returns 0.0 even though the answer looks correct: The metric is substring-based and case-insensitive. If the model writes “26.3%” but key_facts contains "26.3", it should match. Check that the expected string is not wrapped in extra whitespace or punctuation in your EvalCase definition.
ToolMessage content is empty in extract_retrieved_doc_ids: Some tool errors return an empty string. Add a fallback: check msg.additional_kwargs or log the raw tool output to diagnose which tool is failing silently.
RateLimitError from OpenAI during eval: The eval loop runs 5 questions sequentially. Add import time; time.sleep(1) between cases in run_eval if you hit rate limits on a free-tier key.
Next steps
- Add a re-ranking step between
searchandfind. A cross-encoder re-ranker (e.g.,sentence-transformers/cross-encoder/ms-marco-MiniLM-L-6-v2) can improve precision before the agent starts navigating documents. - Persist state across sessions using LangGraph’s built-in checkpointing. Pass a
MemorySavertograph.compile(checkpointer=MemorySaver())to resume interrupted retrieval loops [1]. - Scale the corpus by replacing
bm25_searchwith a call to a real vector store (Chroma, Pinecone, or pgvector). The tool interface stays identical; only the retrieval backend changes. - Extend the eval set with questions that require multi-document synthesis (e.g., “Compare AI VC investment growth to renewable energy capacity growth in 2022-2023”). These are the cases where the agentic loop’s iterative retrieval produces the largest gains over single-shot baselines [2].
FAQ
How does agentic retrieval improve over single-shot RAG?
Agentic retrieval allows the model to iteratively search, navigate within documents, and validate partial answers before committing, rather than embedding a query once and stuffing top-k chunks into a prompt. The AgenticRAG paper shows this shift alone accounts for a 5.9x improvement in retrieval quality.
What are the four tools in the AgenticRAG pipeline?
The four tools are: search (find relevant documents by BM25 ranking), find (locate sentences containing a keyword within a document), open_doc (retrieve the full text of a document), and summarize (extract sentences focused on a topic).
How does the LangGraph agent decide when to stop retrieving?
The agent uses a conditional edge that checks whether the last message contains tool calls. If it does, the graph routes to the tools node; if not, the agent has produced a final answer and the graph exits to END.
What metrics does the evaluation harness measure?
The harness measures recall@1 (whether the correct document was retrieved at least once) and factuality (the fraction of expected key facts present in the final answer), mirroring the metrics in the AgenticRAG paper.
Can the corpus backend be swapped for a production vector store?
Yes. The tool interface remains identical; only the retrieval backend in bm25_search needs to change. The tutorial shows how to replace it with calls to Chroma, Pinecone, or pgvector without modifying the agent or eval harness.