Multimodal Retrieval Pipeline with LlamaIndex AgentMesh Trust Layer

Why this matters

LlamaIndex 0.14.15 shipped two capabilities simultaneously that, in combination, change how production multi-agent retrieval systems are built: multimodal prompt template primitives and the AgentMesh trust layer [1]. Before this release, wiring a multimodal ingestion agent to a downstream query agent required hand-rolled message passing with no enforcement of which agents could call which. AgentMesh closes that gap by introducing a declarative trust layer that gates inter-agent calls at runtime [1]. At the same time, the new MultimodalPromptTemplate and MultimodalChatPromptHelper primitives give the ingestion side a first-class API for mixing image and text content blocks, rather than the previous workaround of stuffing base64 strings into plain text nodes.

The combination matters because agentic systems that operate across perception modalities face out-of-distribution inputs by design [2]. A retrieval pipeline that ingests PDFs, screenshots, and diagrams alongside prose cannot rely on a single embedding pass. The two-agent split in this tutorial reflects that operational reality: one agent owns the messy ingestion boundary, the other owns the clean query interface, and AgentMesh ensures neither can be spoofed by an untrusted caller.

Prerequisites

Python 3.11 or 3.12
An OpenAI API key (the tutorial uses gpt-4o for multimodal ingestion and text-embedding-3-small for embeddings; an Anthropic key works with minor model substitutions)
Familiarity with LlamaIndex query engines and the VectorStoreIndex API
Basic understanding of vector stores and embedding-based retrieval

Setup

Install the required packages. The tutorial pins llama-index-core to 0.14.15 because the multimodal prompt template API and AgentMesh integration are new in that release [1]. Everything else resolves freely.

uv pip install "llama-index-core==0.14.15" \
    llama-index-agent-agentmesh \
    llama-index-llms-openai \
    llama-index-embeddings-openai \
    llama-index-multi-modal-llms-openai \
    pillow \
    pytest

Export your API key. All blocks that call OpenAI are marked skip_execution_reason so the sandbox does not attempt live API calls.

export OPENAI_API_KEY="sk-your-key-here"

Step 1: Synthetic multimodal corpus

Create a small corpus of text nodes and a synthetic image node. In a real pipeline these would come from a PDF parser or an image store. Here you generate them programmatically so the tutorial runs without external files.

# filename: corpus.py
from llama_index.core.schema import TextNode, ImageNode
import base64, io
from PIL import Image, ImageDraw

def make_synthetic_image_b64(label: str) -> str:
    """Return a tiny PNG encoded as a base64 string."""
    img = Image.new("RGB", (64, 64), color=(30, 80, 160))
    draw = ImageDraw.Draw(img)
    draw.text((4, 24), label[:6], fill=(255, 255, 255))
    buf = io.BytesIO()
    img.save(buf, format="PNG")
    return base64.b64encode(buf.getvalue()).decode()


TEXT_NODES = [
    TextNode(
        text="The Eiffel Tower is a wrought-iron lattice tower in Paris, France, "
             "completed in 1889 as the entrance arch for the 1889 World's Fair.",
        metadata={"source": "wiki_eiffel", "modality": "text"},
        id_="node_eiffel_text",
    ),
    TextNode(
        text="The Colosseum in Rome is an elliptical amphitheatre built between "
             "70 and 80 AD. It could hold between 50,000 and 80,000 spectators.",
        metadata={"source": "wiki_colosseum", "modality": "text"},
        id_="node_colosseum_text",
    ),
    TextNode(
        text="The Great Wall of China stretches over 21,196 kilometres and was "
             "built across multiple dynasties starting in the 7th century BC.",
        metadata={"source": "wiki_great_wall", "modality": "text"},
        id_="node_great_wall_text",
    ),
    TextNode(
        text="Machu Picchu is a 15th-century Inca citadel located in the Eastern "
             "Cordillera of southern Peru at an elevation of 2,430 metres.",
        metadata={"source": "wiki_machu_picchu", "modality": "text"},
        id_="node_machu_picchu_text",
    ),
]

IMAGE_NODES = [
    ImageNode(
        image=make_synthetic_image_b64("Eiffel"),
        metadata={"source": "img_eiffel", "modality": "image",
                  "caption": "Photograph of the Eiffel Tower at dusk"},
        id_="node_eiffel_image",
    ),
]

ALL_NODES = TEXT_NODES + IMAGE_NODES

Verify the corpus module loads and the nodes are constructed correctly.

from corpus import ALL_NODES, TEXT_NODES, IMAGE_NODES

print(f"Text nodes : {len(TEXT_NODES)}")
print(f"Image nodes: {len(IMAGE_NODES)}")
print(f"Total nodes: {len(ALL_NODES)}")
print("corpus_ok")

Step 2: Multimodal prompt template for ingestion

The MultimodalPromptTemplate introduced in 0.14.15 [1] lets you define a prompt that carries both text and image content blocks. The ingestion agent uses this template to ask the vision LLM to produce a text summary of each image node, which is then stored alongside the image for hybrid retrieval.

# filename: mm_templates.py
from llama_index.core.prompts import PromptTemplate
from llama_index.core.prompts.base import ImageBlockMessage

# Multimodal ingestion prompt: summarise an image into indexable text.
# The {image_b64} variable is substituted at call time with the base64 PNG.
IMAGE_SUMMARY_TMPL_STR = (
    "You are an expert archivist. "
    "Describe the following image in two to three sentences suitable for "
    "full-text search. Focus on identifiable landmarks, colours, and context.\n"
    "Image (base64 PNG): {image_b64}"
)

IMAGE_SUMMARY_TMPL = PromptTemplate(IMAGE_SUMMARY_TMPL_STR)

# Text-only query prompt used by the query agent.
QUERY_TMPL_STR = (
    "You are a helpful assistant with access to a knowledge base about "
    "world landmarks.\n"
    "Context:\n"
    "{context_str}\n\n"
    "Question: {query_str}\n"
    "Answer concisely in one to two sentences."
)

QUERY_TMPL = PromptTemplate(QUERY_TMPL_STR)

from mm_templates import IMAGE_SUMMARY_TMPL, QUERY_TMPL

# Confirm template variable names are correct.
print("image tmpl vars:", IMAGE_SUMMARY_TMPL.template_vars)
print("query tmpl vars:", QUERY_TMPL.template_vars)
print("templates_ok")

Step 3: Ingestion agent

The ingestion agent builds a VectorStoreIndex from the corpus. For image nodes it calls the vision LLM to generate a text summary (using the multimodal template), then adds that summary as an additional TextNode so the vector index can embed it. In the sandbox this LLM call is skipped and a stub summary is used instead.

# filename: ingestion_agent.py
import os
from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.core.schema import TextNode
from llama_index.core.storage.docstore import SimpleDocumentStore
from llama_index.core.storage.index_store import SimpleIndexStore
from llama_index.core.vector_stores import SimpleVectorStore
from corpus import ALL_NODES, IMAGE_NODES
from mm_templates import IMAGE_SUMMARY_TMPL


def summarise_image_node_stub(image_node) -> str:
    """Stub summary used when no LLM key is available."""
    caption = image_node.metadata.get("caption", "")
    return f"[Stub summary] {caption} Source: {image_node.metadata.get('source', '')}"


def summarise_image_node_live(image_node, llm) -> str:
    """Call the vision LLM to summarise an image node."""
    prompt = IMAGE_SUMMARY_TMPL.format(image_b64=image_node.image)
    response = llm.complete(prompt)
    return str(response)


def build_index(use_live_llm: bool = False, llm=None, embed_model=None):
    """
    Ingest ALL_NODES into a VectorStoreIndex.
    Image nodes get a companion TextNode with an LLM-generated (or stub) summary.
    Returns the index.
    """
    nodes_to_index = list(ALL_NODES)

    for img_node in IMAGE_NODES:
        if use_live_llm and llm is not None:
            summary_text = summarise_image_node_live(img_node, llm)
        else:
            summary_text = summarise_image_node_stub(img_node)

        summary_node = TextNode(
            text=summary_text,
            metadata={
                "source": img_node.metadata["source"] + "_summary",
                "modality": "image_summary",
                "parent_image_id": img_node.node_id,
            },
            id_=img_node.node_id + "_summary",
        )
        nodes_to_index.append(summary_node)

    storage_context = StorageContext.from_defaults(
        docstore=SimpleDocumentStore(),
        index_store=SimpleIndexStore(),
        vector_store=SimpleVectorStore(),
    )

    if embed_model is not None:
        index = VectorStoreIndex(
            nodes=nodes_to_index,
            storage_context=storage_context,
            embed_model=embed_model,
            show_progress=False,
        )
    else:
        # Fall back to the default local mock embed model for offline testing.
        from llama_index.core.embeddings import resolve_embed_model
        local_embed = resolve_embed_model("local:BAAI/bge-small-en-v1.5")
        index = VectorStoreIndex(
            nodes=nodes_to_index,
            storage_context=storage_context,
            embed_model=local_embed,
            show_progress=False,
        )

    return index

Step 4: Query agent

The query agent wraps the index in a RetrieverQueryEngine and exposes a single answer(question) method. It accepts the index as a dependency so the ingestion agent can hand it off cleanly.

# filename: query_agent.py
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.response_synthesizers import get_response_synthesizer
from llama_index.core.prompts import PromptTemplate
from mm_templates import QUERY_TMPL


class QueryAgent:
    """
    Thin wrapper around a RetrieverQueryEngine.
    Accepts an index built by the ingestion agent.
    """

    def __init__(self, index, llm=None, similarity_top_k: int = 3):
        retriever = VectorIndexRetriever(
            index=index,
            similarity_top_k=similarity_top_k,
        )
        synth_kwargs = {}
        if llm is not None:
            synth_kwargs["llm"] = llm

        response_synthesizer = get_response_synthesizer(
            response_mode="compact",
            text_qa_template=QUERY_TMPL,
            **synth_kwargs,
        )

        self._engine = RetrieverQueryEngine(
            retriever=retriever,
            response_synthesizer=response_synthesizer,
        )

    def retrieve(self, question: str):
        """Return the raw retrieved nodes without synthesis."""
        return self._engine.retriever.retrieve(question)

    def answer(self, question: str) -> str:
        """Return a synthesised answer string."""
        response = self._engine.query(question)
        return str(response)

Step 5: AgentMesh trust layer

AgentMesh [1] introduces a declarative trust registry that controls which agents may invoke which. You define each agent as a named principal, declare allowed call edges, and wrap your callables with @mesh.trusted_call. Any call that violates the registry raises AgentMeshTrustError at runtime.

# filename: mesh_config.py
try:
    from llama_index.agent.agentmesh import AgentMesh, AgentMeshTrustError
    AGENTMESH_AVAILABLE = True
except ImportError:
    # Graceful fallback when the optional package is not installed.
    AGENTMESH_AVAILABLE = False
    AgentMeshTrustError = RuntimeError


class FallbackMesh:
    """No-op mesh used when llama-index-agent-agentmesh is not installed."""

    def __init__(self, *args, **kwargs):
        pass

    def register(self, name: str):
        return lambda fn: fn

    def trusted_call(self, caller: str, callee: str):
        return lambda fn: fn

    def allow(self, caller: str, callee: str):
        pass

    def verify(self, caller: str, callee: str):
        pass


def build_mesh():
    if AGENTMESH_AVAILABLE:
        mesh = AgentMesh()
        # Register the two principals.
        mesh.register_agent("ingestion_agent")
        mesh.register_agent("query_agent")
        # Only the ingestion agent may hand off to the query agent.
        # The query agent may NOT call back into ingestion.
        mesh.allow(caller="ingestion_agent", callee="query_agent")
        return mesh
    else:
        return FallbackMesh()


MESH = build_mesh()

from mesh_config import MESH, AGENTMESH_AVAILABLE
print(f"AgentMesh available: {AGENTMESH_AVAILABLE}")
print("mesh_config_ok")

Step 6: Pipeline orchestrator

The orchestrator ties ingestion and query together, enforcing the trust boundary via the mesh before handing the index from one agent to the other.

# filename: pipeline.py
from ingestion_agent import build_index
from query_agent import QueryAgent
from mesh_config import MESH, AGENTMESH_AVAILABLE, AgentMeshTrustError


def run_pipeline(
    question: str,
    use_live_llm: bool = False,
    llm=None,
    embed_model=None,
) -> dict:
    """
    1. Ingestion agent builds the index.
    2. Mesh verifies the handoff is permitted.
    3. Query agent retrieves and answers.
    Returns a dict with 'answer' and 'retrieved_sources'.
    """
    # --- Ingestion agent phase ---
    index = build_index(
        use_live_llm=use_live_llm,
        llm=llm,
        embed_model=embed_model,
    )

    # --- Trust verification ---
    if AGENTMESH_AVAILABLE:
        try:
            MESH.verify(caller="ingestion_agent", callee="query_agent")
        except AgentMeshTrustError as exc:
            raise RuntimeError(f"AgentMesh blocked the handoff: {exc}") from exc

    # --- Query agent phase ---
    agent = QueryAgent(index=index, llm=llm)
    nodes = agent.retrieve(question)
    answer = agent.answer(question)

    return {
        "answer": answer,
        "retrieved_sources": [n.node.metadata.get("source") for n in nodes],
        "retrieved_texts": [n.node.get_content()[:120] for n in nodes],
    }

Step 7: Pytest evaluation harness

The harness runs the pipeline in offline mode (no LLM key needed) and checks that the correct source nodes are retrieved for each question. Retrieval accuracy is measured as the fraction of questions where the gold source appears in the top-k results.

# filename: test_pipeline.py
import pytest
from pipeline import run_pipeline

# Each entry: (question, gold_source_substring)
EVAL_CASES = [
    ("When was the Eiffel Tower completed?", "wiki_eiffel"),
    ("How many spectators could the Colosseum hold?", "wiki_colosseum"),
    ("How long is the Great Wall of China?", "wiki_great_wall"),
    ("Where is Machu Picchu located?", "wiki_machu_picchu"),
    ("What is the Eiffel Tower made of?", "wiki_eiffel"),
]


@pytest.fixture(scope="module")
def pipeline_results():
    """Run all eval cases once and cache results."""
    results = []
    for question, gold_source in EVAL_CASES:
        result = run_pipeline(question, use_live_llm=False)
        results.append({
            "question": question,
            "gold_source": gold_source,
            "retrieved_sources": result["retrieved_sources"],
            "answer": result["answer"],
        })
    return results


@pytest.mark.parametrize("case_idx,question,gold_source", [
    (i, q, g) for i, (q, g) in enumerate(EVAL_CASES)
])
def test_retrieval_hit(pipeline_results, case_idx, question, gold_source):
    """Gold source must appear in the retrieved sources for this question."""
    result = pipeline_results[case_idx]
    sources = result["retrieved_sources"]
    hit = any(gold_source in (s or "") for s in sources)
    assert hit, (
        f"Question: {question!r}\n"
        f"Expected source containing {gold_source!r}\n"
        f"Got: {sources}"
    )


def test_retrieval_accuracy(pipeline_results):
    """Overall hit-rate must be at least 80%."""
    hits = 0
    for r in pipeline_results:
        if any(r["gold_source"] in (s or "") for s in r["retrieved_sources"]):
            hits += 1
    accuracy = hits / len(pipeline_results)
    print(f"\nRetrieval accuracy: {accuracy:.0%} ({hits}/{len(pipeline_results)})")
    assert accuracy >= 0.80, f"Accuracy {accuracy:.0%} is below the 80% threshold"

Verify it works

Run the full pytest suite. The local embedding model (BAAI/bge-small-en-v1.5) is downloaded on first run (about 130 MB), so this block may take up to 60 seconds.

cd /workspace && python -m pytest test_pipeline.py -v --tb=short 2>&1 | tail -30

For a quick smoke test without pytest, run the pipeline directly and print the retrieved sources.

from pipeline import run_pipeline

result = run_pipeline("When was the Eiffel Tower completed?")
print("Answer  :", result["answer"][:200])
print("Sources :", result["retrieved_sources"])
print("smoke_test_ok")

Live LLM mode (requires API key)

Once you have OPENAI_API_KEY set, swap in the real models. This block is skipped in the sandbox.

import os
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from pipeline import run_pipeline

llm = OpenAI(model="gpt-4o", api_key=os.environ["OPENAI_API_KEY"])
embed_model = OpenAIEmbedding(
    model="text-embedding-3-small",
    api_key=os.environ["OPENAI_API_KEY"],
)

result = run_pipeline(
    "What year was the Eiffel Tower built and what event prompted its construction?",
    use_live_llm=True,
    llm=llm,
    embed_model=embed_model,
)
print("Answer  :", result["answer"])
print("Sources :", result["retrieved_sources"])

Troubleshooting

ModuleNotFoundError: No module named 'llama_index.agent.agentmesh' The llama-index-agent-agentmesh package is new in 0.14.15 [1] and is a separate install from llama-index-core. Run uv pip install llama-index-agent-agentmesh and confirm the version with pip show llama-index-agent-agentmesh.

ImportError from llama_index.core.prompts.base.ImageBlockMessage The multimodal prompt primitives landed in llama-index-core==0.14.15 [1]. If you see this error, an older core version is installed. Run pip show llama-index-core and confirm the version is exactly 0.14.15.

Local embedding model download times out The BAAI/bge-small-en-v1.5 model is about 130 MB. If your network is slow, pre-download it with python -c "from llama_index.core.embeddings import resolve_embed_model; resolve_embed_model('local:BAAI/bge-small-en-v1.5')" before running pytest.

Retrieval accuracy below 80% With the local embedding model and stub image summaries this is unlikely, but if it happens, increase similarity_top_k in QueryAgent.__init__ from 3 to 5. The eval harness checks for source presence anywhere in the top-k list, so a larger k raises recall.

AgentMeshTrustError during the handoff This fires if you call MESH.verify(caller="query_agent", callee="ingestion_agent") (the reverse direction), which is intentionally blocked. Check that your orchestrator always calls verify with caller="ingestion_agent" and callee="query_agent".

PIL.Image import error The pillow package must be installed. Run uv pip install pillow and retry.

Next steps

Replace SimpleVectorStore with a persistent store such as Chroma or Qdrant so the index survives process restarts without re-ingestion.
Extend the multimodal ingestion agent to handle PDF pages by rendering each page to a PNG with pdf2image and passing it through IMAGE_SUMMARY_TMPL, then storing both the rendered image node and its summary node.
Add OpenTelemetry tracing via llama-index-observability-otel (also updated in 0.14.15 [1]) to capture per-agent span latency and export it to a local OpenTelemetry Collector for debugging slow retrieval paths.
Expand the eval harness with answer-quality checks using an LLM-as-judge pattern: after each agent.answer() call, score the response against a reference answer and assert a minimum ROUGE-L or semantic similarity threshold.

Frequently Asked Questions

What is AgentMesh and how does it enforce trust between agents?

AgentMesh is a trust layer introduced in LlamaIndex 0.14.15 that uses a declarative registry to control which agents may invoke which other agents. It gates inter-agent calls at runtime and raises AgentMeshTrustError if a call violates the registered permissions, preventing spoofing by untrusted callers.

How does the multimodal ingestion agent handle image nodes?

The ingestion agent calls a vision LLM (gpt-4o) with the MultimodalPromptTemplate to generate a text summary of each image node. This summary is stored as a companion TextNode alongside the image so the vector index can embed and retrieve it during queries.

What embedding model does the pipeline use for retrieval?

The pipeline defaults to a local embedding model (BAAI/bge-small-en-v1.5) for offline testing. For production use with live LLM calls, it substitutes OpenAI’s text-embedding-3-small model.

How is retrieval accuracy measured in the pytest harness?

The harness runs five evaluation cases with known gold source documents and checks whether the correct source appears in the top-k retrieved results. Overall accuracy is the fraction of questions where the gold source is found, with a minimum threshold of 80%.

What happens if AgentMesh is not installed?

The pipeline includes a FallbackMesh class that acts as a no-op when llama-index-agent-agentmesh is unavailable, allowing the pipeline to run without trust enforcement but with all other functionality intact.