# RAG Pipeline with Retrieval Tracing and Eval Harness Using LangSmith

> Build a retrieval-augmented generation pipeline that traces every retrieval step in LangSmith and runs a repeatable eval harness scoring faithfulness and answer relevance across a fixed test set, producing a pass/fail report you can gate on in CI.

- Canonical URL: https://agentry.press/tutorial/rag-pipeline-with-retrieval-tracing-and-eval-harness-using-langsmith/
- Type: Tutorial
- Published: 2026-05-31
- By: agentry
- Tags: rag, langsmith, retrieval, evaluation, langchain, openai

---

## Why this matters

Production RAG systems fail in ways that are invisible without structured tracing. A retrieval step that returns semantically distant chunks, a reranker that silently drops context, or a generator that hallucinates despite good retrieval — none of these surface in aggregate accuracy numbers alone. Platforms like Tencent's WeKnora [1] demonstrate that turning raw documents into a queryable knowledge base is now table-stakes; the differentiator is knowing *why* a given query failed and being able to prove the pipeline improved after a change.

LangSmith's dataset and evaluation APIs let you attach a fixed test set to a pipeline and score every run against it. Without that harness, teams iterate on chunking strategies or embedding models and have no reproducible signal — only vibes from spot-checking. This tutorial wires the full loop: a FAISS-backed retriever, a generation chain, LangSmith tracing on every retrieval call, and an evaluator that scores faithfulness and answer relevance, then prints a pass/fail report you can read in CI output or in the LangSmith UI.

## Prerequisites

- Python 3.11 or 3.12
- A LangSmith account with an API key (`LANGSMITH_API_KEY`)
- An OpenAI API key (`OPENAI_API_KEY`)
- Familiarity with vector stores and basic LangChain concepts
- The `langsmith` CLI is optional; all operations here use the Python SDK

## Setup

Install the required packages. FAISS ships as a CPU-only wheel that builds without CUDA.

```bash
uv pip install langchain langchain-openai langchain-community faiss-cpu langsmith openai tiktoken
```

Export your credentials. The LangSmith SDK reads `LANGSMITH_API_KEY` and `LANGCHAIN_TRACING_V2` automatically.

```bash
export LANGSMITH_API_KEY="your-langsmith-api-key"
export OPENAI_API_KEY="your-openai-api-key"
export LANGCHAIN_TRACING_V2="true"
export LANGCHAIN_PROJECT="rag-eval-tutorial"
```

## Step 1: Build the document store

Create a small but realistic corpus. In a real project you would load PDFs or web pages; here you define the documents inline so the tutorial runs without external data.

```python
# filename: corpus.py
from langchain_core.documents import Document

DOCS = [
    Document(
        page_content=(
            "The transformer architecture was introduced in the paper "
            "'Attention Is All You Need' by Vaswani et al. in 2017. "
            "It relies entirely on self-attention mechanisms and dispenses "
            "with recurrence and convolutions."
        ),
        metadata={"source": "transformers_overview", "topic": "architecture"},
    ),
    Document(
        page_content=(
            "BERT (Bidirectional Encoder Representations from Transformers) "
            "is a pre-trained language model developed by Google in 2018. "
            "It is trained on masked language modelling and next-sentence prediction."
        ),
        metadata={"source": "bert_overview", "topic": "pretraining"},
    ),
    Document(
        page_content=(
            "GPT models use a decoder-only transformer architecture and are "
            "trained with a causal language modelling objective. "
            "GPT-3, released by OpenAI in 2020, has 175 billion parameters."
        ),
        metadata={"source": "gpt_overview", "topic": "pretraining"},
    ),
    Document(
        page_content=(
            "Retrieval-Augmented Generation (RAG) combines a retriever with "
            "a generative model. The retriever fetches relevant documents from "
            "a corpus; the generator conditions its output on those documents. "
            "RAG reduces hallucination by grounding responses in retrieved evidence."
        ),
        metadata={"source": "rag_overview", "topic": "rag"},
    ),
    Document(
        page_content=(
            "FAISS (Facebook AI Similarity Search) is a library for efficient "
            "similarity search and clustering of dense vectors. It supports "
            "exact and approximate nearest-neighbour search and runs on CPU and GPU."
        ),
        metadata={"source": "faiss_overview", "topic": "vector_search"},
    ),
    Document(
        page_content=(
            "LangSmith is an observability and evaluation platform for LLM "
            "applications. It captures traces of every LLM call, tool invocation, "
            "and retrieval step, and lets you run dataset-backed evaluations "
            "to score faithfulness, relevance, and correctness."
        ),
        metadata={"source": "langsmith_overview", "topic": "observability"},
    ),
]
```

Now build the FAISS index from those documents.

```python
# filename: vector_store.py
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from corpus import DOCS

def build_vector_store() -> FAISS:
    embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
    store = FAISS.from_documents(DOCS, embeddings)
    return store
```

## Step 2: Build the traced RAG chain

The chain wraps the retriever and a generation step. LangChain's LCEL pipes emit spans automatically when `LANGCHAIN_TRACING_V2=true`, so every retrieval call and LLM invocation appears in LangSmith without manual instrumentation.

```python
# filename: rag_chain.py
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough, RunnableLambda
from langchain_openai import ChatOpenAI
from vector_store import build_vector_store

RAG_PROMPT = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are a helpful assistant. Answer the question using ONLY the "
            "provided context. If the context does not contain enough information, "
            "say 'I don't know based on the provided context.'\n\nContext:\n{context}",
        ),
        ("human", "{question}"),
    ]
)


def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


def build_rag_chain(k: int = 2):
    store = build_vector_store()
    retriever = store.as_retriever(search_kwargs={"k": k})

    chain = (
        {
            "context": retriever | RunnableLambda(format_docs),
            "question": RunnablePassthrough(),
        }
        | RAG_PROMPT
        | ChatOpenAI(model="gpt-4o-mini", temperature=0)
        | StrOutputParser()
    )
    return chain
```

## Step 3: Create the LangSmith dataset

A LangSmith dataset is a fixed collection of (input, expected output) pairs. The evaluator runs your chain against every example and scores each one. Create the dataset once; subsequent runs reuse it.

```python
# filename: create_dataset.py
import os
from langsmith import Client

DATASET_NAME = "rag-eval-tutorial-dataset"

EXAMPLES = [
    {
        "inputs": {"question": "What architecture does BERT use?"},
        "outputs": {
            "answer": "BERT uses the transformer architecture and is trained with "
            "masked language modelling and next-sentence prediction."
        },
    },
    {
        "inputs": {"question": "How many parameters does GPT-3 have?"},
        "outputs": {"answer": "GPT-3 has 175 billion parameters."},
    },
    {
        "inputs": {"question": "What is RAG and how does it reduce hallucination?"},
        "outputs": {
            "answer": "RAG combines a retriever with a generative model. The retriever "
            "fetches relevant documents; the generator conditions its output on them. "
            "This grounds responses in retrieved evidence and reduces hallucination."
        },
    },
    {
        "inputs": {"question": "What is FAISS used for?"},
        "outputs": {
            "answer": "FAISS is used for efficient similarity search and clustering of "
            "dense vectors, supporting exact and approximate nearest-neighbour search."
        },
    },
    {
        "inputs": {"question": "Who introduced the transformer architecture?"},
        "outputs": {
            "answer": "The transformer architecture was introduced by Vaswani et al. in 2017 "
            "in the paper 'Attention Is All You Need'."
        },
    },
]


def get_or_create_dataset(client: Client) -> str:
    existing = [d for d in client.list_datasets() if d.name == DATASET_NAME]
    if existing:
        print(f"Dataset already exists: {existing[0].id}")
        return existing[0].id

    dataset = client.create_dataset(
        dataset_name=DATASET_NAME,
        description="Fixed test set for RAG eval tutorial",
    )
    client.create_examples(
        inputs=[e["inputs"] for e in EXAMPLES],
        outputs=[e["outputs"] for e in EXAMPLES],
        dataset_id=dataset.id,
    )
    print(f"Created dataset: {dataset.id} with {len(EXAMPLES)} examples")
    return dataset.id
```

## Step 4: Write the evaluators

Two evaluators score each (question, answer) pair. The faithfulness evaluator checks whether the answer is grounded in the retrieved context. The relevance evaluator checks whether the answer addresses the question. Both use an LLM judge pattern, which is the standard approach when you lack ground-truth labels for every possible answer phrasing.

```python
# filename: evaluators.py
from langsmith.schemas import Run, Example
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

_judge = None


def _get_judge():
    global _judge
    if _judge is None:
        _judge = ChatOpenAI(model="gpt-4o-mini", temperature=0)
    return _judge


_FAITHFULNESS_PROMPT = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are an evaluation assistant. Given a question, a generated answer, "
            "and a reference answer, score the generated answer on FAITHFULNESS: "
            "does it contain only information that is consistent with the reference? "
            "Reply with a single integer score: 1 (faithful) or 0 (unfaithful). "
            "Reply with ONLY the integer, nothing else.",
        ),
        (
            "human",
            "Question: {question}\n\nReference answer: {reference}\n\nGenerated answer: {answer}",
        ),
    ]
)

_RELEVANCE_PROMPT = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are an evaluation assistant. Given a question and a generated answer, "
            "score the answer on RELEVANCE: does it directly address what the question asks? "
            "Reply with a single integer score: 1 (relevant) or 0 (not relevant). "
            "Reply with ONLY the integer, nothing else.",
        ),
        ("human", "Question: {question}\n\nGenerated answer: {answer}"),
    ]
)


def faithfulness_evaluator(run: Run, example: Example) -> dict:
    question = example.inputs.get("question", "")
    reference = (example.outputs or {}).get("answer", "")
    generated = run.outputs.get("output", "") if run.outputs else ""

    judge = _get_judge()
    chain = _FAITHFULNESS_PROMPT | judge
    result = chain.invoke(
        {"question": question, "reference": reference, "answer": generated}
    )
    try:
        score = int(result.content.strip())
    except ValueError:
        score = 0

    return {"key": "faithfulness", "score": score}


def relevance_evaluator(run: Run, example: Example) -> dict:
    question = example.inputs.get("question", "")
    generated = run.outputs.get("output", "") if run.outputs else ""

    judge = _get_judge()
    chain = _RELEVANCE_PROMPT | judge
    result = chain.invoke({"question": question, "answer": generated})
    try:
        score = int(result.content.strip())
    except ValueError:
        score = 0

    return {"key": "relevance", "score": score}
```

## Step 5: Run the eval harness

The harness creates the dataset if it doesn't exist, runs the RAG chain against every example, collects scores from both evaluators, and prints a pass/fail report. A run passes if the mean faithfulness and relevance scores both meet the threshold.

```python
# filename: eval_harness.py
import os
from langsmith import Client
from langsmith.evaluation import evaluate
from create_dataset import get_or_create_dataset, DATASET_NAME
from rag_chain import build_rag_chain
from evaluators import faithfulness_evaluator, relevance_evaluator

FAITHFULNESS_THRESHOLD = 0.8
RELEVANCE_THRESHOLD = 0.8


def target(inputs: dict) -> dict:
    chain = build_rag_chain(k=2)
    answer = chain.invoke(inputs["question"])
    return {"output": answer}


def run_eval():
    client = Client()
    get_or_create_dataset(client)

    results = evaluate(
        target,
        data=DATASET_NAME,
        evaluators=[faithfulness_evaluator, relevance_evaluator],
        experiment_prefix="rag-tutorial",
        metadata={"model": "gpt-4o-mini", "retriever_k": 2},
        client=client,
    )

    # Collect scores from the results
    faithfulness_scores = []
    relevance_scores = []

    for result in results:
        eval_results = result.get("evaluation_results", {})
        for er in eval_results.get("results", []):
            if er.key == "faithfulness":
                faithfulness_scores.append(er.score if er.score is not None else 0)
            elif er.key == "relevance":
                relevance_scores.append(er.score if er.score is not None else 0)

    mean_faithfulness = (
        sum(faithfulness_scores) / len(faithfulness_scores)
        if faithfulness_scores
        else 0.0
    )
    mean_relevance = (
        sum(relevance_scores) / len(relevance_scores) if relevance_scores else 0.0
    )

    print("\n=== RAG Eval Report ===")
    print(f"Examples evaluated : {len(faithfulness_scores)}")
    print(f"Mean faithfulness  : {mean_faithfulness:.2f} (threshold {FAITHFULNESS_THRESHOLD})")
    print(f"Mean relevance     : {mean_relevance:.2f} (threshold {RELEVANCE_THRESHOLD})")

    faith_pass = mean_faithfulness >= FAITHFULNESS_THRESHOLD
    rel_pass = mean_relevance >= RELEVANCE_THRESHOLD
    overall = faith_pass and rel_pass

    print(f"Faithfulness       : {'PASS' if faith_pass else 'FAIL'}")
    print(f"Relevance          : {'PASS' if rel_pass else 'FAIL'}")
    print(f"Overall            : {'PASS' if overall else 'FAIL'}")
    print("======================\n")

    return overall


if __name__ == "__main__":
    passed = run_eval()
    raise SystemExit(0 if passed else 1)
```

> [!PULLQUOTE]
> Without a fixed test set and repeatable scores, teams iterate on chunking strategies or embedding models and have no reproducible signal — only vibes from spot-checking.

## Step 6: Verify the module structure

Before running the full eval (which requires API keys), verify that all modules import cleanly and the chain builds without errors. The `build_rag_chain` function is written so the FAISS index is constructed lazily inside the function body; the structural check below imports the module without triggering any API calls.

```python
import importlib
import sys

# Verify all modules are importable
modules = ["corpus", "vector_store", "rag_chain", "create_dataset", "evaluators", "eval_harness"]
for mod in modules:
    try:
        importlib.import_module(mod)
        print(f"OK: {mod}")
    except Exception as e:
        print(f"FAIL: {mod} -> {e}")
        sys.exit(1)

print("All modules imported successfully")
```

## Verify it works

With your API keys set, run the full eval harness:

```bash
python eval_harness.py
```

Expected output shape (scores will vary by model response):

```
Created dataset: <uuid> with 5 examples

=== RAG Eval Report ===
Examples evaluated : 5
Mean faithfulness  : 0.80 (threshold 0.8)
Mean relevance     : 1.00 (threshold 0.8)
Faithfulness       : PASS
Relevance          : PASS
Overall            : PASS
======================
```

The LangSmith UI at `https://smith.langchain.com` will show the experiment under the `rag-eval-tutorial` project. Each run has a full trace: the retrieval step with the documents returned, the prompt sent to the LLM, and the generated answer. The experiment view shows per-example scores and aggregate metrics across runs, so you can compare a `k=2` retriever against `k=4` side by side.

To run the harness in CI, add this to your pipeline:

```yaml
- name: RAG eval
  env:
    LANGSMITH_API_KEY: ${{ secrets.LANGSMITH_API_KEY }}
    OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
    LANGCHAIN_TRACING_V2: "true"
    LANGCHAIN_PROJECT: rag-eval-tutorial
  run: python eval_harness.py
```

The script exits with code 1 if either threshold fails, which fails the CI step.

## Troubleshooting

**`langsmith.utils.LangSmithAuthError: API key not found`** — The `LANGSMITH_API_KEY` environment variable is not set or is empty. Run `echo $LANGSMITH_API_KEY` to confirm it is exported in the current shell session.

**`openai.AuthenticationError: Incorrect API key`** — The `OPENAI_API_KEY` value is wrong or expired. Verify it at `https://platform.openai.com/api-keys` and re-export.

**`ModuleNotFoundError: No module named 'faiss'`** — The `faiss-cpu` wheel did not install. Run `uv pip install faiss-cpu` again; on some platforms you may need `pip install faiss-cpu --no-binary :all:` to trigger a source build.

**Eval scores are all 0 even though answers look correct** — The LLM judge returned a non-integer string (e.g. "Score: 1"). The `try/except ValueError` in the evaluators defaults to 0 in that case. Inspect `result.content` by adding a `print` inside the evaluator to see the raw response, then tighten the prompt to enforce integer-only output.

**Dataset examples accumulate across runs** — `get_or_create_dataset` checks by name and returns early if the dataset exists. If you want a fresh dataset, delete it in the LangSmith UI or change `DATASET_NAME` to a new string.

**Traces not appearing in LangSmith** — Confirm `LANGCHAIN_TRACING_V2=true` is exported (not just set in a `.env` file that isn't loaded). The LangChain SDK reads this variable at import time; if it was not set before `import langchain`, tracing is disabled for that process.

## Next steps

- **Swap the retriever**: Replace FAISS with a persistent store such as Chroma or pgvector. The chain interface stays identical; only `build_vector_store` changes.
- **Add a reranker**: Insert a cross-encoder reranking step between the retriever and the prompt. Trace the reranked document list as a separate span by wrapping it in a `@traceable` decorator from `langsmith`.
- **Extend the test set**: Load your real support tickets or user queries into the LangSmith dataset via `client.create_examples`. Larger datasets surface edge cases that a five-example set misses.
- **Compare experiments**: Run the harness with `k=2` and `k=4`, then use the LangSmith experiment comparison view to see which retriever depth improves faithfulness without hurting latency.

## FAQ

### How does LangSmith tracing capture retrieval steps without manual instrumentation?

LangChain's LCEL pipes emit spans automatically when the LANGCHAIN_TRACING_V2 environment variable is set to true, so every retrieval call and LLM invocation appears in LangSmith without explicit instrumentation code.

### What do the faithfulness and relevance evaluators measure?

The faithfulness evaluator checks whether the generated answer contains only information consistent with the reference answer. The relevance evaluator checks whether the answer directly addresses what the question asks. Both use an LLM judge pattern.

### How can the eval harness be integrated into CI?

Add a CI step that sets the required environment variables (LANGSMITH_API_KEY, OPENAI_API_KEY, LANGCHAIN_TRACING_V2, LANGCHAIN_PROJECT) and runs the eval_harness.py script, which exits with code 1 if either the faithfulness or relevance threshold fails.

### What happens if the LLM judge returns a non-integer response?

The evaluators catch ValueError exceptions and default the score to 0, which will cause the eval to fail. Inspecting the raw response and tightening the prompt to enforce integer-only output can resolve this issue.

### How does the dataset persist across multiple eval runs?

The get_or_create_dataset function checks for an existing dataset by name and returns early if found, so subsequent runs reuse the same fixed test set without duplicating examples.

## References

1. https://github.com/Tencent/WeKnora