Agentic RAG Pipeline with LangGraph and Iterative Tool Retrieval

Why this matters

The AgenticRAG paper [2] reports that layering a lightweight agentic harness on top of standard retrieval infrastructure improves retrieval quality: 49.6% recall@1 on BRIGHT (+21.8 pp over the best embedding baseline) and 0.96 factuality on WixQA. The most significant factor, per the ablation study, is the shift from single-shot retrieval to agentic tool use, which alone accounts for a 5.9x improvement [2].

Most production RAG pipelines today still use single-shot retrieval: embed the query, fetch top-k chunks, stuff them into a prompt, and generate. That architecture forces the retrieval stack to do all the grounding work in one pass, with no mechanism for the model to ask follow-up questions, navigate within a document, or validate a partial answer before committing. When a question requires synthesizing evidence from multiple documents or resolving ambiguous terminology, a single-shot pipeline returns an answer with no signal that its retrieval was incomplete. LangGraph’s graph primitives [1] give you the control flow to implement iterative retrieval without writing a custom state machine from scratch.

Prerequisites

Python 3.11 or 3.12
An OpenAI or Anthropic API key (blocks that call the LLM are marked skip_execution_reason; all other blocks run in the sandbox)
Familiarity with async/await in Python
Basic understanding of LangGraph nodes and edges [1]

Setup

Install the required packages. The tutorial uses langchain-openai for the LLM binding; swap in langchain-anthropic if you prefer Claude.

uv pip install langgraph langchain-openai langchain-core tiktoken numpy

Export your API key before running the agent steps:

export OPENAI_API_KEY="sk-..."  # replace with your actual key

Step 1: Build the in-memory document corpus

The corpus is a small collection of plain-text documents stored in a Python dict. A real deployment would point at an enterprise search index, but an in-memory store is enough to demonstrate the agentic loop and run the eval harness without network dependencies.

# filename: corpus.py
from __future__ import annotations
import re
from typing import Any

DOCUMENTS: dict[str, str] = {
    "doc_climate_001": """
    Global average temperatures have risen by approximately 1.1 degrees Celsius
    since pre-industrial times. The Intergovernmental Panel on Climate Change (IPCC)
    projects that limiting warming to 1.5 C requires net-zero CO2 emissions by 2050.
    Renewable energy capacity grew by 295 GW in 2022, the largest annual increase on record.
    Solar photovoltaic installations accounted for 60 percent of that growth.
    """,
    "doc_climate_002": """
    Arctic sea ice extent reached a record low in September 2012 at 3.41 million km2.
    Since 1979, Arctic sea ice has declined at a rate of roughly 13 percent per decade.
    Permafrost thaw releases methane, a greenhouse gas with 80 times the warming
    potential of CO2 over a 20-year horizon.
    """,
    "doc_finance_001": """
    The S&P 500 index returned 26.3 percent in 2023, recovering from a 19.4 percent
    decline in 2022. The Federal Reserve raised the federal funds rate to a target range
    of 5.25 to 5.50 percent in July 2023, the highest level in 22 years.
    Inflation as measured by CPI fell from a peak of 9.1 percent in June 2022
    to 3.4 percent by December 2023.
    """,
    "doc_finance_002": """
    Venture capital investment in AI startups reached 91.9 billion USD in 2023,
    representing 30 percent of all global VC funding. Generative AI companies
    raised 25.2 billion USD, a 9x increase from 2022 levels.
    The median pre-money valuation for Series A AI rounds was 42 million USD.
    """,
    "doc_biology_001": """
    CRISPR-Cas9 gene editing was first demonstrated in human cells in 2013 by
    the Zhang and Doudna-Charpentier groups. The first approved CRISPR therapy,
    Casgevy, received FDA approval in December 2023 for sickle cell disease
    and beta-thalassemia. The editing efficiency in clinical trials exceeded 90 percent.
    """,
    "doc_biology_002": """
    The human genome contains approximately 3.2 billion base pairs encoding
    around 20,000 protein-coding genes. Only about 1.5 percent of the genome
    codes for proteins; the remainder includes regulatory elements, introns,
    and regions of unknown function sometimes called 'dark matter' DNA.
    """,
}


def _tokenize(text: str) -> list[str]:
    return re.findall(r"[a-z0-9]+", text.lower())


def bm25_search(query: str, top_k: int = 3) -> list[dict[str, Any]]:
    """Minimal BM25-style TF-IDF search over DOCUMENTS."""
    k1, b = 1.5, 0.75
    query_tokens = set(_tokenize(query))
    doc_lengths = {doc_id: len(_tokenize(text)) for doc_id, text in DOCUMENTS.items()}
    avg_len = sum(doc_lengths.values()) / len(doc_lengths)

    scores: dict[str, float] = {}
    for doc_id, text in DOCUMENTS.items():
        tokens = _tokenize(text)
        tf: dict[str, int] = {}
        for t in tokens:
            tf[t] = tf.get(t, 0) + 1
        score = 0.0
        for term in query_tokens:
            if term not in tf:
                continue
            df = sum(1 for d in DOCUMENTS.values() if term in _tokenize(d))
            idf = (len(DOCUMENTS) - df + 0.5) / (df + 0.5)
            tf_norm = (tf[term] * (k1 + 1)) / (
                tf[term] + k1 * (1 - b + b * doc_lengths[doc_id] / avg_len)
            )
            score += idf * tf_norm
        scores[doc_id] = score

    ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    return [
        {"doc_id": doc_id, "score": round(score, 3), "snippet": DOCUMENTS[doc_id][:200].strip()}
        for doc_id, score in ranked[:top_k]
        if score > 0
    ]


def open_document(doc_id: str) -> str:
    """Return the full text of a document by ID."""
    return DOCUMENTS.get(doc_id, f"Document '{doc_id}' not found.")


def find_in_document(doc_id: str, keyword: str) -> list[str]:
    """Return sentences from a document that contain the keyword."""
    text = DOCUMENTS.get(doc_id, "")
    sentences = re.split(r"(?<=[.!?])\s+", text.strip())
    keyword_lower = keyword.lower()
    return [s.strip() for s in sentences if keyword_lower in s.lower()]

Verify the search function works before wiring it into the agent:

from corpus import bm25_search, open_document, find_in_document

results = bm25_search("CRISPR gene editing FDA approval", top_k=2)
for r in results:
    print(r["doc_id"], r["score"])

sentences = find_in_document("doc_biology_001", "CRISPR")
print("\nMatching sentences:")
for s in sentences:
    print(" -", s)

print("\ncorpus_ok")

Step 2: Define the four tools

The AgenticRAG harness exposes four tools to the model: search, find, open, and summarize [2]. Each tool is a plain Python function wrapped with LangChain’s @tool decorator so LangGraph can bind it to the LLM.

# filename: tools.py
from __future__ import annotations
from langchain_core.tools import tool
from corpus import bm25_search, open_document, find_in_document


@tool
def search(query: str) -> str:
    """Search the document corpus for passages relevant to the query.
    Returns a ranked list of document IDs with short snippets.
    Use this to discover which documents are relevant before opening them.
    """
    results = bm25_search(query, top_k=3)
    if not results:
        return "No documents matched the query."
    lines = []
    for r in results:
        lines.append(f"[{r['doc_id']}] score={r['score']}\n  {r['snippet']}")
    return "\n\n".join(lines)


@tool
def find(doc_id: str, keyword: str) -> str:
    """Find sentences containing a keyword inside a specific document.
    Use this to navigate within a document without reading the full text.
    Args:
        doc_id: The document identifier returned by the search tool.
        keyword: A word or short phrase to locate inside the document.
    """
    hits = find_in_document(doc_id, keyword)
    if not hits:
        return f"Keyword '{keyword}' not found in {doc_id}."
    return "\n".join(f"- {s}" for s in hits)


@tool
def open_doc(doc_id: str) -> str:
    """Open and return the full text of a document.
    Use this when snippets are insufficient and you need the complete content.
    Args:
        doc_id: The document identifier returned by the search tool.
    """
    return open_document(doc_id)


@tool
def summarize(doc_id: str, focus: str) -> str:
    """Return the portion of a document most relevant to a focus topic.
    This is a lightweight extractive summary: it returns sentences that
    contain any word from the focus phrase.
    Args:
        doc_id: The document identifier.
        focus: A topic or question to focus the summary on.
    """
    hits = find_in_document(doc_id, focus.split()[0])
    if not hits:
        return f"No content about '{focus}' found in {doc_id}."
    return "\n".join(f"- {s}" for s in hits[:5])

from tools import search, find, open_doc, summarize

print(search.invoke({"query": "Arctic sea ice decline rate"}))
print("\ntools_ok")

Step 3: Build the LangGraph agent

The agent is a StateGraph with two nodes: agent (the LLM deciding what to do next) and tools (executing the chosen tool). A conditional edge routes back to agent after each tool call, or exits to END when the model produces a final answer without a tool call [1].

# filename: agent.py
from __future__ import annotations
import os
from typing import Annotated
from langchain_core.messages import BaseMessage, SystemMessage
from langchain_openai import ChatOpenAI
from langgraph.graph import StateGraph, END
from langgraph.graph.message import add_messages
from langgraph.prebuilt import ToolNode
from typing_extensions import TypedDict
from tools import search, find, open_doc, summarize

TOOLS = [search, find, open_doc, summarize]

SYSTEM_PROMPT = """You are a research assistant with access to a document corpus.
Use the available tools to iteratively retrieve and validate information before answering.

Strategy:
1. Start with `search` to identify relevant documents.
2. Use `find` to locate specific facts within a document without reading everything.
3. Use `open_doc` when you need the full document text.
4. Use `summarize` to extract focused content on a sub-topic.
5. Repeat steps 1-4 as needed until you have sufficient evidence.
6. Provide a concise, grounded answer citing the document IDs you used.

Never guess. If the corpus does not contain the answer, say so explicitly."""


class AgentState(TypedDict):
    messages: Annotated[list[BaseMessage], add_messages]


def build_agent(model_name: str = "gpt-4o-mini") -> object:
    llm = ChatOpenAI(model=model_name, temperature=0)
    llm_with_tools = llm.bind_tools(TOOLS)

    def agent_node(state: AgentState) -> AgentState:
        messages = [SystemMessage(content=SYSTEM_PROMPT)] + state["messages"]
        response = llm_with_tools.invoke(messages)
        return {"messages": [response]}

    def should_continue(state: AgentState) -> str:
        last = state["messages"][-1]
        if hasattr(last, "tool_calls") and last.tool_calls:
            return "tools"
        return END

    tool_node = ToolNode(TOOLS)

    graph = StateGraph(AgentState)
    graph.add_node("agent", agent_node)
    graph.add_node("tools", tool_node)
    graph.set_entry_point("agent")
    graph.add_conditional_edges("agent", should_continue, {"tools": "tools", END: END})
    graph.add_edge("tools", "agent")

    return graph.compile()

Verify the graph compiles and its structure is correct:

from agent import build_agent

app = build_agent()
print("Nodes:", list(app.get_graph().nodes.keys()))
print("agent_graph_ok")

Step 4: Run the agent on a sample question

This block requires a live OpenAI API key. It demonstrates the iterative tool-calling loop on a question that requires synthesizing two documents.

from langchain_core.messages import HumanMessage
from agent import build_agent

app = build_agent()

question = "What was the rate of Arctic sea ice decline per decade, and what greenhouse gas does permafrost thaw release?"

result = app.invoke({"messages": [HumanMessage(content=question)]})

for msg in result["messages"]:
    role = getattr(msg, "type", type(msg).__name__)
    content = msg.content if isinstance(msg.content, str) else str(msg.content)
    if content:
        print(f"[{role}] {content[:300]}")
        print()

Step 5: Build the evaluation harness

The eval harness measures two metrics that mirror the AgenticRAG paper [2]: recall@1 (did the agent retrieve the correct document at least once?) and factuality (does the final answer contain the expected key facts?).

# filename: eval_harness.py
from __future__ import annotations
import re
from dataclasses import dataclass, field
from typing import Any
from langchain_core.messages import HumanMessage, ToolMessage
from agent import build_agent


@dataclass
class EvalCase:
    question: str
    relevant_doc_ids: list[str]  # docs that must be retrieved for recall@1
    key_facts: list[str]         # substrings that must appear in the final answer


EVAL_SET: list[EvalCase] = [
    EvalCase(
        question="What percentage of global VC funding did AI startups represent in 2023?",
        relevant_doc_ids=["doc_finance_002"],
        key_facts=["30"],
    ),
    EvalCase(
        question="When did the FDA approve the first CRISPR therapy and for which diseases?",
        relevant_doc_ids=["doc_biology_001"],
        key_facts=["December 2023", "sickle cell"],
    ),
    EvalCase(
        question="What was the S&P 500 return in 2023?",
        relevant_doc_ids=["doc_finance_001"],
        key_facts=["26.3"],
    ),
    EvalCase(
        question="How much of the human genome codes for proteins?",
        relevant_doc_ids=["doc_biology_002"],
        key_facts=["1.5"],
    ),
    EvalCase(
        question="What was the largest annual increase in renewable energy capacity and which technology led it?",
        relevant_doc_ids=["doc_climate_001"],
        key_facts=["295", "solar"],
    ),
]


def extract_retrieved_doc_ids(messages: list[Any]) -> set[str]:
    """Collect all doc_ids that appeared in tool call arguments or results."""
    ids: set[str] = set()
    for msg in messages:
        # Check tool call arguments
        if hasattr(msg, "tool_calls"):
            for tc in msg.tool_calls:
                args = tc.get("args", {})
                for v in args.values():
                    if isinstance(v, str) and v.startswith("doc_"):
                        ids.add(v)
        # Check tool message content for doc IDs
        if isinstance(msg, ToolMessage):
            found = re.findall(r"doc_[a-z]+_\d+", msg.content)
            ids.update(found)
    return ids


def recall_at_1(retrieved: set[str], relevant: list[str]) -> float:
    """1.0 if any relevant doc was retrieved, else 0.0."""
    return 1.0 if any(d in retrieved for d in relevant) else 0.0


def factuality_score(answer: str, key_facts: list[str]) -> float:
    """Fraction of key facts present in the answer (case-insensitive)."""
    if not key_facts:
        return 1.0
    hits = sum(1 for f in key_facts if f.lower() in answer.lower())
    return hits / len(key_facts)


def run_eval(model_name: str = "gpt-4o-mini") -> dict[str, Any]:
    app = build_agent(model_name)
    results = []

    for case in EVAL_SET:
        state = app.invoke({"messages": [HumanMessage(content=case.question)]})
        messages = state["messages"]

        # Final answer is the last AI message with non-empty text content
        final_answer = ""
        for msg in reversed(messages):
            content = msg.content if isinstance(msg.content, str) else ""
            if content and not getattr(msg, "tool_calls", None):
                final_answer = content
                break

        retrieved = extract_retrieved_doc_ids(messages)
        r1 = recall_at_1(retrieved, case.relevant_doc_ids)
        fact = factuality_score(final_answer, case.key_facts)

        results.append({
            "question": case.question[:60],
            "recall@1": r1,
            "factuality": round(fact, 2),
            "retrieved": sorted(retrieved),
            "answer_snippet": final_answer[:120],
        })

    avg_recall = sum(r["recall@1"] for r in results) / len(results)
    avg_factuality = sum(r["factuality"] for r in results) / len(results)

    return {
        "per_case": results,
        "avg_recall@1": round(avg_recall, 3),
        "avg_factuality": round(avg_factuality, 3),
    }

Step 6: Compare single-shot vs. agentic retrieval

To reproduce the paper’s core finding [2], run a baseline that does a single BM25 search and answers directly, then compare it to the full agentic pipeline.

# filename: baseline.py
from __future__ import annotations
from langchain_core.messages import HumanMessage, SystemMessage
from langchain_openai import ChatOpenAI
from corpus import bm25_search
from eval_harness import EVAL_SET, factuality_score


def run_baseline(model_name: str = "gpt-4o-mini") -> dict:
    llm = ChatOpenAI(model=model_name, temperature=0)
    results = []

    for case in EVAL_SET:
        # Single-shot: one search, stuff top-3 into the prompt, generate
        hits = bm25_search(case.question, top_k=3)
        context = "\n\n".join(
            f"[{h['doc_id']}]\n{h['snippet']}" for h in hits
        )
        messages = [
            SystemMessage(content="Answer the question using only the provided context."),
            HumanMessage(content=f"Context:\n{context}\n\nQuestion: {case.question}"),
        ]
        response = llm.invoke(messages)
        answer = response.content

        retrieved = {h["doc_id"] for h in hits}
        r1 = 1.0 if any(d in retrieved for d in case.relevant_doc_ids) else 0.0
        fact = factuality_score(answer, case.key_facts)

        results.append({
            "question": case.question[:60],
            "recall@1": r1,
            "factuality": round(fact, 2),
        })

    avg_recall = sum(r["recall@1"] for r in results) / len(results)
    avg_factuality = sum(r["factuality"] for r in results) / len(results)
    return {
        "per_case": results,
        "avg_recall@1": round(avg_recall, 3),
        "avg_factuality": round(avg_factuality, 3),
    }

Verify it works

This block runs the full comparison without an API key by exercising only the corpus and eval infrastructure. The agent and baseline blocks that call the LLM are marked to skip.

from corpus import bm25_search, find_in_document
from tools import search, find, open_doc, summarize
from eval_harness import (
    EVAL_SET,
    extract_retrieved_doc_ids,
    recall_at_1,
    factuality_score,
)
from agent import build_agent

# Verify corpus search
assert len(bm25_search("CRISPR FDA approval", top_k=2)) >= 1, "search returned nothing"

# Verify find
hits = find_in_document("doc_biology_001", "CRISPR")
assert len(hits) >= 1, "find returned nothing"

# Verify tool wrappers
result = search.invoke({"query": "Arctic sea ice"})
assert "doc_climate" in result, "search tool did not return climate doc"

find_result = find.invoke({"doc_id": "doc_climate_002", "keyword": "methane"})
assert "methane" in find_result.lower(), "find tool did not locate methane"

# Verify eval metrics
fake_retrieved = {"doc_finance_002"}
assert recall_at_1(fake_retrieved, ["doc_finance_002"]) == 1.0
assert recall_at_1(set(), ["doc_finance_002"]) == 0.0
assert factuality_score("AI raised 30 percent of VC", ["30"]) == 1.0
assert factuality_score("no relevant info", ["30"]) == 0.0

# Verify graph compiles
app = build_agent()
nodes = list(app.get_graph().nodes.keys())
assert "agent" in nodes and "tools" in nodes, f"unexpected nodes: {nodes}"

print(f"Eval set size: {len(EVAL_SET)} cases")
print(f"Graph nodes: {nodes}")
print("all_checks_passed")

To run the live eval against the API (requires your key to be set), execute:

import json
from eval_harness import run_eval
from baseline import run_baseline

baseline = run_baseline()
agentic = run_eval()

print("=== Baseline (single-shot) ===")
print(f"  avg recall@1:   {baseline['avg_recall@1']}")
print(f"  avg factuality: {baseline['avg_factuality']}")

print("\n=== Agentic RAG ===")
print(f"  avg recall@1:   {agentic['avg_recall@1']}")
print(f"  avg factuality: {agentic['avg_factuality']}")

print("\n=== Per-case breakdown (agentic) ===")
for r in agentic["per_case"]:
    print(f"  Q: {r['question']}")
    print(f"     recall@1={r['recall@1']}  factuality={r['factuality']}")
    print(f"     retrieved={r['retrieved']}")

The most significant factor is the shift from single-shot retrieval to agentic tool use, accounting for a 5.9x improvement in the AgenticRAG ablation study.

Troubleshooting

ModuleNotFoundError: No module named 'langchain_openai': Run uv pip install langchain-openai (note the hyphen). The package name on PyPI uses a hyphen but imports with an underscore.

The agent loops indefinitely without reaching END: The should_continue function checks last.tool_calls. If your LLM version returns tool calls in a different attribute, add a max_iterations guard: pass recursion_limit=10 to app.invoke({...}, config={"recursion_limit": 10}).

recall@1 is 0.0 for all cases in the baseline: BM25 on short snippets is sensitive to query phrasing. Try expanding the top_k argument in run_baseline from 3 to 5, or add synonyms to the query before searching.

factuality_score returns 0.0 even though the answer looks correct: The metric is substring-based and case-insensitive. If the model writes “26.3%” but key_facts contains "26.3", it should match. Check that the expected string is not wrapped in extra whitespace or punctuation in your EvalCase definition.

ToolMessage content is empty in extract_retrieved_doc_ids: Some tool errors return an empty string. Add a fallback: check msg.additional_kwargs or log the raw tool output to diagnose which tool is failing silently.

RateLimitError from OpenAI during eval: The eval loop runs 5 questions sequentially. Add import time; time.sleep(1) between cases in run_eval if you hit rate limits on a free-tier key.

Next steps

Add a re-ranking step between search and find. A cross-encoder re-ranker (e.g., sentence-transformers/cross-encoder/ms-marco-MiniLM-L-6-v2) can improve precision before the agent starts navigating documents.
Persist state across sessions using LangGraph’s built-in checkpointing. Pass a MemorySaver to graph.compile(checkpointer=MemorySaver()) to resume interrupted retrieval loops [1].
Scale the corpus by replacing bm25_search with a call to a real vector store (Chroma, Pinecone, or pgvector). The tool interface stays identical; only the retrieval backend changes.
Extend the eval set with questions that require multi-document synthesis (e.g., “Compare AI VC investment growth to renewable energy capacity growth in 2022-2023”). These are the cases where the agentic loop’s iterative retrieval produces the largest gains over single-shot baselines [2].

FAQ

How does agentic retrieval improve over single-shot RAG?

Agentic retrieval allows the model to iteratively search, navigate within documents, and validate partial answers before committing, rather than embedding a query once and stuffing top-k chunks into a prompt. The AgenticRAG paper shows this shift alone accounts for a 5.9x improvement in retrieval quality.

What are the four tools in the AgenticRAG pipeline?

The four tools are: search (find relevant documents by BM25 ranking), find (locate sentences containing a keyword within a document), open_doc (retrieve the full text of a document), and summarize (extract sentences focused on a topic).

How does the LangGraph agent decide when to stop retrieving?

The agent uses a conditional edge that checks whether the last message contains tool calls. If it does, the graph routes to the tools node; if not, the agent has produced a final answer and the graph exits to END.

What metrics does the evaluation harness measure?

The harness measures recall@1 (whether the correct document was retrieved at least once) and factuality (the fraction of expected key facts present in the final answer), mirroring the metrics in the AgenticRAG paper.

Can the corpus backend be swapped for a production vector store?

Yes. The tool interface remains identical; only the retrieval backend in bm25_search needs to change. The tutorial shows how to replace it with calls to Chroma, Pinecone, or pgvector without modifying the agent or eval harness.