Why this matters

Eval harnesses for retrieval pipelines are now a first-order production concern. Without scored, trace-linked evaluations, teams iterate on chunking strategies, embedding models, and prompt templates in the dark: a change that improves one query cluster silently degrades another, and the regression only surfaces in user complaints. Braintrust’s scoring model links every LLM call to a numeric score and a stored trace, so you can diff two experiment runs the same way you diff two code commits. This tutorial builds that harness end-to-end: a FAISS-backed retrieval chain, two LLM-as-judge scorers (faithfulness and answer relevance), and a Braintrust experiment that persists results for comparison. The same pattern applies to any retrieval pipeline where you need to catch regressions before they reach users.

Prerequisites

  • Python 3.11 or later
  • An OpenAI API key (for embeddings and the judge LLM calls)
  • A Braintrust account (free tier is sufficient) and a Braintrust API key
  • Basic familiarity with LangChain chains and retrievers

Setup

Install the required packages. langchain-community provides the FAISS vector store integration, langchain-openai provides the OpenAI wrappers, and braintrust is the evaluation SDK.

uv pip install langchain-core langchain-community langchain-openai faiss-cpu openai braintrust autoevals

Export your API keys. Every block after this one inherits these variables.

export OPENAI_API_KEY="sk-your-openai-key-here"
export BRAINTRUST_API_KEY="your-braintrust-key-here"

Step 1: Build the Document Corpus and Vector Store

The corpus here is a small set of hand-written passages about three topics: the water cycle, photosynthesis, and plate tectonics. In a real pipeline you would load PDFs or database rows; the structure is identical.

# filename: corpus.py
from langchain_core.documents import Document

DOCUMENTS = [
    Document(
        page_content=(
            "The water cycle describes the continuous movement of water on, above, "
            "and below Earth's surface. Evaporation converts liquid water into vapor, "
            "which rises, cools, and condenses into clouds. Precipitation returns water "
            "to the surface as rain or snow. Runoff and infiltration complete the cycle "
            "by returning water to rivers, lakes, and groundwater."
        ),
        metadata={"topic": "water_cycle"},
    ),
    Document(
        page_content=(
            "Photosynthesis is the process by which plants, algae, and some bacteria "
            "convert light energy into chemical energy stored as glucose. The reaction "
            "takes place primarily in chloroplasts and requires carbon dioxide, water, "
            "and sunlight. Oxygen is released as a byproduct. The overall equation is: "
            "6CO2 + 6H2O + light -> C6H12O6 + 6O2."
        ),
        metadata={"topic": "photosynthesis"},
    ),
    Document(
        page_content=(
            "Plate tectonics is the scientific theory explaining the large-scale motion "
            "of Earth's lithospheric plates. Plates move due to convection currents in "
            "the mantle. Their interactions cause earthquakes, volcanic activity, and "
            "the formation of mountain ranges. The theory unified earlier concepts such "
            "as continental drift and seafloor spreading."
        ),
        metadata={"topic": "plate_tectonics"},
    ),
    Document(
        page_content=(
            "Evaporation is a key phase of the water cycle. It occurs when water at the "
            "surface absorbs enough energy to transition from liquid to gas. Factors "
            "affecting evaporation rate include temperature, humidity, wind speed, and "
            "the surface area of the water body."
        ),
        metadata={"topic": "water_cycle"},
    ),
    Document(
        page_content=(
            "Chlorophyll is the pigment responsible for the green color of plants and "
            "for capturing light energy during photosynthesis. It absorbs light most "
            "efficiently in the blue and red wavelengths and reflects green light. "
            "Chlorophyll a and chlorophyll b are the two main types found in plants."
        ),
        metadata={"topic": "photosynthesis"},
    ),
]

Now build the FAISS index from those documents using OpenAI embeddings.

# filename: vector_store.py
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from corpus import DOCUMENTS

def build_vector_store():
    embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
    store = FAISS.from_documents(DOCUMENTS, embeddings)
    return store

Step 2: Build the Retrieval Chain

The chain retrieves the top-k passages for a query, formats them into a context block, and calls GPT-4o-mini to produce an answer. The chain returns both the answer and the retrieved source documents so the scorer can check faithfulness against the actual retrieved context.

# filename: rag_chain.py
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough, RunnableLambda
from langchain_openai import ChatOpenAI
from vector_store import build_vector_store

RAG_PROMPT = ChatPromptTemplate.from_messages([
    (
        "system",
        "You are a helpful assistant. Answer the question using ONLY the context "
        "provided below. If the context does not contain enough information, say "
        "'I don't have enough information to answer that.'\n\nContext:\n{context}",
    ),
    ("human", "{question}"),
])

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

def build_rag_chain(k: int = 2):
    store = build_vector_store()
    retriever = store.as_retriever(search_kwargs={"k": k})

    def retrieve_and_answer(question: str) -> dict:
        docs = retriever.invoke(question)
        context = format_docs(docs)
        llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
        chain = RAG_PROMPT | llm | StrOutputParser()
        answer = chain.invoke({"context": context, "question": question})
        return {
            "answer": answer,
            "context": context,
            "question": question,
        }

    return retrieve_and_answer

Step 3: Define the Eval Dataset

Each example in the dataset has an input (the question), an expected answer (the ground-truth reference), and a metadata field that records which topic the question covers. The expected answers are used by the answer-relevance scorer; the faithfulness scorer uses only the retrieved context and the generated answer.

# filename: eval_dataset.py
EVAL_EXAMPLES = [
    {
        "input": "What is the role of evaporation in the water cycle?",
        "expected": (
            "Evaporation converts liquid water into vapor, which rises and eventually "
            "condenses into clouds, driving the continuous movement of water through "
            "the water cycle."
        ),
        "metadata": {"topic": "water_cycle"},
    },
    {
        "input": "What inputs does photosynthesis require?",
        "expected": (
            "Photosynthesis requires carbon dioxide, water, and sunlight (light energy)."
        ),
        "metadata": {"topic": "photosynthesis"},
    },
    {
        "input": "Why do tectonic plates move?",
        "expected": (
            "Tectonic plates move because of convection currents in Earth's mantle."
        ),
        "metadata": {"topic": "plate_tectonics"},
    },
    {
        "input": "What wavelengths does chlorophyll absorb most efficiently?",
        "expected": (
            "Chlorophyll absorbs light most efficiently in the blue and red wavelengths."
        ),
        "metadata": {"topic": "photosynthesis"},
    },
    {
        "input": "What happens to water after precipitation?",
        "expected": (
            "After precipitation, water returns to the surface and then flows as runoff "
            "or infiltrates the ground, eventually reaching rivers, lakes, or groundwater."
        ),
        "metadata": {"topic": "water_cycle"},
    },
]

Step 4: Write the Scorers

Two scorers run on every example. The faithfulness scorer checks whether the generated answer is supported by the retrieved context (no hallucinations). The relevance scorer checks whether the answer actually addresses the question, using the ground-truth expected answer as a reference. Both use autoevals, Braintrust’s open-source scorer library, which wraps LLM-as-judge calls into a consistent numeric output between 0 and 1.

# filename: scorers.py
from autoevals import Factuality, ClosedQA

def score_faithfulness(output: dict, expected: str) -> dict:
    """
    Checks whether the answer is grounded in the retrieved context.
    Uses the context as the 'expected' reference for the Factuality scorer.
    """
    scorer = Factuality()
    result = scorer(
        output=output["answer"],
        expected=output["context"],
        input=output["question"],
    )
    return {"name": "faithfulness", "score": result.score, "metadata": result.metadata}

def score_answer_relevance(output: dict, expected: str) -> dict:
    """
    Checks whether the answer addresses the question, using the ground-truth
    expected answer as the reference.
    """
    scorer = Factuality()
    result = scorer(
        output=output["answer"],
        expected=expected,
        input=output["question"],
    )
    return {"name": "answer_relevance", "score": result.score, "metadata": result.metadata}

Step 5: Wire the Braintrust Eval Harness

The harness iterates over the eval dataset, calls the RAG chain for each example, runs both scorers, and logs every result to a Braintrust experiment. Each call to experiment.log() creates a trace entry that links the input, output, scores, and metadata together in the Braintrust UI.

# filename: run_eval.py
import os
import braintrust
from rag_chain import build_rag_chain
from eval_dataset import EVAL_EXAMPLES
from scorers import score_faithfulness, score_answer_relevance

def run_eval(experiment_name: str = "rag-baseline"):
    api_key = os.environ.get("BRAINTRUST_API_KEY", "")
    project_name = "langchain-rag-eval"

    experiment = braintrust.init(
        project=project_name,
        experiment=experiment_name,
        api_key=api_key,
    )

    rag_chain = build_rag_chain(k=2)

    results = []
    for example in EVAL_EXAMPLES:
        question = example["input"]
        expected = example["expected"]

        with experiment.start_span(name="rag_call", input=question) as span:
            output = rag_chain(question)
            span.log(output=output["answer"])

        faith = score_faithfulness(output, expected)
        relevance = score_answer_relevance(output, expected)

        experiment.log(
            input=question,
            output=output["answer"],
            expected=expected,
            scores={
                faith["name"]: faith["score"],
                relevance["name"]: relevance["score"],
            },
            metadata=example.get("metadata", {}),
        )

        results.append({
            "question": question,
            "answer": output["answer"],
            "faithfulness": faith["score"],
            "answer_relevance": relevance["score"],
        })
        print(
            f"Q: {question[:60]}...\n"
            f"  faithfulness={faith['score']:.2f}  "
            f"relevance={relevance['score']:.2f}"
        )

    experiment.flush()
    print(f"\nExperiment '{experiment_name}' logged to project '{project_name}'.")
    return results

if __name__ == "__main__":
    run_eval()

Each call to experiment.log() creates a trace entry that links the input, output, scores, and metadata together, so you can diff two experiment runs the same way you diff two code commits.

Step 6: Run the Eval

This block executes the harness. It requires both API keys and makes real LLM calls, so it is marked as skip-execution in the sandbox. On your machine, run it after exporting your keys.

from run_eval import run_eval

results = run_eval(experiment_name="rag-baseline")

# Print a summary table
print("\n--- Summary ---")
for r in results:
    print(
        f"{r['question'][:50]:50s} "
        f"faith={r['faithfulness']:.2f} "
        f"rel={r['answer_relevance']:.2f}"
    )

Verify it works

This block validates that all modules import cleanly and the data structures are correct, without making any API calls.

import sys
import os

# Verify module files exist
import importlib.util

modules_to_check = [
    "corpus",
    "eval_dataset",
]

for mod_name in modules_to_check:
    spec = importlib.util.spec_from_file_location(mod_name, f"/workspace/{mod_name}.py")
    mod = importlib.util.module_from_spec(spec)
    spec.loader.exec_module(mod)
    print(f"OK: {mod_name} loaded")

# Verify corpus shape
from corpus import DOCUMENTS
assert len(DOCUMENTS) == 5, f"Expected 5 documents, got {len(DOCUMENTS)}"
assert all(hasattr(d, 'page_content') for d in DOCUMENTS), "Documents missing page_content"
print(f"OK: corpus has {len(DOCUMENTS)} documents")

# Verify eval dataset shape
from eval_dataset import EVAL_EXAMPLES
assert len(EVAL_EXAMPLES) == 5, f"Expected 5 examples, got {len(EVAL_EXAMPLES)}"
assert all('input' in e and 'expected' in e for e in EVAL_EXAMPLES), "Examples missing required keys"
print(f"OK: eval dataset has {len(EVAL_EXAMPLES)} examples")

# Verify braintrust and autoevals are importable
import braintrust
import autoevals
from importlib.metadata import version
print(f"OK: braintrust=={version('braintrust')}")
print(f"OK: autoevals=={version('autoevals')}")

# Verify langchain packages
from langchain_core.documents import Document
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
print("OK: langchain_core imports")

from langchain_community.vectorstores import FAISS
print("OK: FAISS import")

print("\nAll checks passed.")

Comparing Experiments for Regression Detection

Once you have a baseline experiment logged, you can run a second experiment with a different configuration (more retrieved chunks, a different prompt, a different embedding model) and compare the two in the Braintrust UI. The experiment comparison view shows per-question score deltas, so you can see exactly which questions regressed and inspect the full trace for each.

To run a second experiment from the command line:

python -c "
from run_eval import run_eval
run_eval(experiment_name='rag-k3-retrieval')
" || echo "Skipped: requires API keys"

In the Braintrust UI, navigate to your project, select both experiments, and click “Compare”. Each row in the comparison table is a question; each column is a scorer. Red cells indicate regressions; green cells indicate improvements.

Troubleshooting

ModuleNotFoundError: No module named 'faiss': The faiss-cpu package installs the faiss C extension. If you see this error, confirm the install completed without errors. On some Linux builds, faiss-cpu requires libgomp; install it with apt-get install libgomp1 if you are running in a custom container.

AuthenticationError from OpenAI: The OPENAI_API_KEY environment variable is not set or is set to a placeholder. Export the real key before running any block that calls OpenAIEmbeddings or ChatOpenAI.

braintrust.errors.AuthenticationError or HTTP 401: The BRAINTRUST_API_KEY is missing or incorrect. Generate a new key at https://www.braintrust.dev/app/settings and re-export it.

Scores are all None or 0.0: The autoevals scorers make OpenAI API calls internally. If those calls fail silently (rate limit, quota exceeded), scores default to None. Check your OpenAI usage dashboard and retry with a lower concurrency.

experiment.log() raises ValueError: experiment not initialized: You called experiment.log() outside the scope of a braintrust.init() context. Confirm braintrust.init() returned successfully before calling log().

FAISS index is empty or returns wrong documents: If build_vector_store() is called before the OPENAI_API_KEY is set, the embedding call fails and the index is empty. Always export the key before importing vector_store.

Next Steps

  • Add a chunking experiment: Split longer documents with RecursiveCharacterTextSplitter at different chunk sizes and compare faithfulness scores across experiments to find the optimal chunk size for your corpus.
  • Swap the embedding model: Replace text-embedding-3-small with text-embedding-3-large or an open-source model via langchain-huggingface, run the same eval, and compare retrieval quality through the answer-relevance score.
  • Add a latency scorer: Log time.perf_counter() deltas around the rag_chain call and pass them as a custom score named latency_ms to track speed regressions alongside quality regressions.
  • Integrate with CI: Call run_eval() in a GitHub Actions workflow on every pull request that touches the prompt template or retriever configuration, and fail the build if the mean faithfulness score drops below a threshold.

FAQ

How does the faithfulness scorer work in this setup?

The faithfulness scorer uses the Factuality LLM-as-judge from autoevals to check whether the generated answer is grounded in the retrieved context, preventing hallucinations. It compares the answer against the actual retrieved context passages rather than a reference answer.

What does the answer relevance scorer measure?

The answer relevance scorer uses the Factuality scorer to check whether the generated answer actually addresses the question by comparing it against the ground-truth expected answer provided in the eval dataset.

How can you compare two experiment runs to detect regressions?

After logging a baseline experiment and a second experiment with different configuration (e.g., different retrieval k, prompt, or embedding model), you navigate to the Braintrust UI, select both experiments, and click Compare. The comparison table shows per-question score deltas with red cells indicating regressions and green cells indicating improvements.

What information does each trace entry store in Braintrust?

Each call to experiment.log() creates a trace entry that links the input question, generated output, both scorer results (faithfulness and answer relevance scores), and metadata such as the topic, enabling side-by-side comparison of experiment runs.

Why is the retrieved context passed to the faithfulness scorer?

The faithfulness scorer uses the retrieved context as the reference to verify that the answer is supported by actual retrieved documents, rather than checking against a separate expected answer. This catches hallucinations where the model generates plausible-sounding information not present in the retrieval results.