· Generative AI  · 5 min read

Evaluating Hallucinations: How to Trust Your RAG System

Your AI sounds confident, but is it lying? A deeply technical guide on building automated 'LLM-as-a-judge' evaluation pipelines using the RAG Triad and Python.

Your AI sounds confident, but is it lying? A deeply technical guide on building automated 'LLM-as-a-judge' evaluation pipelines using the RAG Triad and Python.

The biggest barrier to enterprise AI adoption is fear of “Hallucination.” When an LLM does not know a fact, it mathematically guesses the next most likely token, confidently inventing plausible-sounding but entirely false information. In a fun consumer chatbot, this is amusing. In enterprise applications involving legal compliance, financial data, or medical advice, it is a catastrophic liability.

While Retrieval-Augmented Generation (RAG) mitigates hallucination by grounding the model in proprietary data, it does not eliminate it. To deploy Generative AI to production, you cannot rely on manual vibe-checks. You must implement rigorous, automated, programmatic evaluation.


1. The RAG Triad Framework

To measure trust mathematically, we break the RAG process down into three distinct evaluative links, known in the industry as the RAG Triad:

  1. Context Relevance (Query $\rightarrow$ Context): Did the Vector Database find useful documents? If the user asks about the HR leave policy, and the vector search returns the cafeteria menu, the Context Relevance score is 0.
  2. Groundedness (Context $\rightarrow$ Answer): Is the final answer entirely supported by the retrieved documents? If the AI invents a rule that isn’t in the provided text, Groundedness drops. This is the true measure of hallucination.
  3. Answer Relevance (Answer $\rightarrow$ Query): Did the AI actually answer the user’s specific question, or did it go on a tangent based on the retrieved context?

2. Automated Evaluation: LLM-as-a-Judge

Traditional software unit tests (assert result == expected) do not work for non-deterministic text generation. Instead, the industry standard is the LLM-as-a-Judge pattern.

We use specialised frameworks like TruLens or Ragas, which leverage highly capable foundational models (like GPT-4o or Claude 3.5 Sonnet) to objectively score the outputs of your production application.

The Judge Prompt

How does an LLM act as a judge? It uses a strict, zero-shot system prompt designed to force objective scoring. Here is an example of the prompt mechanics used to measure Groundedness:

You are an impartial, highly rigorous judge. 
Your task is to evaluate whether the GENERATED ANSWER is entirely grounded in the RETRIEVED CONTEXT.

Context: {retrieved_context}
Answer: {generated_answer}

Rate the answer on a scale of 0 to 1.
- Output 1.0 if the answer contains NO facts outside the context.
- Output 0.0 if the answer invents ANY fact not explicitly stated in the context.
Return ONLY the float value.

3. Python Implementation with TruLens

Here is a concrete technical example of how to wrap a standard LangChain RAG pipeline with TruLens to calculate the RAG Triad automatically on every query.

from trulens_eval import TruChain, Feedback, Tru
from trulens_eval.feedback.provider.openai import OpenAI
from langchain.chains import RetrievalQA

tru = Tru()
# Initialise the LLM provider acting as the judge
provider = OpenAI(model_engine="gpt-4o")

# 1. Define Groundedness Feedback
f_groundedness = (
    Feedback(provider.groundedness_measure_with_cot_reasons)
    .on(TruChain.select.context)
    .on_output()
)

# 2. Define Answer Relevance Feedback
f_answer_relevance = (
    Feedback(provider.relevance)
    .on_input()
    .on_output()
)

# 3. Define Context Relevance Feedback
f_context_relevance = (
    Feedback(provider.qs_relevance)
    .on_input()
    .on(TruChain.select.context)
)

# Wrap your existing LangChain application
tru_recorder = TruChain(
    qa_chain, # Your existing LangChain RetrievalQA object
    app_id='Enterprise_HR_Bot_v1',
    feedbacks=[f_groundedness, f_answer_relevance, f_context_relevance]
)

# Execute a query with the recorder active
with tru_recorder as recording:
    response = qa_chain("How many days of paternity leave do I get?")
    
# TruLens intercepts the context and answer, runs the GPT-4 judge in the background, 
# and logs the 0-1 scores to a local SQLite database or cloud dashboard.

4. CI/CD and The Golden Dataset

You cannot deploy changes to a RAG system blindly. Tweaking the system prompt, switching from text-embedding-3-small to a Cohere embedding model, or changing the chunk size from 500 to 1000 tokens can drastically alter the behaviour of the system.

To ensure stability, you must construct a Golden Dataset:

  • A curated CSV of 500-1,000 highly diverse, representative user queries.
  • The expected “ideal” answer (optional, but highly recommended for reference).

The CI/CD Pipeline

In a mature architecture, this evaluation is integrated directly into your CI/CD pipeline (e.g., GitHub Actions).

When a developer opens a Pull Request proposing a change to the RAG architecture, the CI pipeline boots up an ephemeral environment, runs all 1,000 Golden Dataset queries through the new architecture, and triggers the LLM-as-a-judge evaluation.

If the average Groundedness score drops from 0.92 to 0.85, the build fails, and the PR is rejected. This provides a mathematical, quantitative safety net for Generative AI.


5. The Production Feedback Loop

Evaluation does not stop at deployment. In production, you run a fast, cheaper judge (like Llama 3 8B) on a random sample of live user traffic to monitor performance.

If the programmatic Groundedness score for a live query drops below 0.80, the system automatically flags the conversation ID and sends it to a Slack channel for human review. This Human-in-the-Loop (HITL) process ensures that failures are caught, logged, and injected back into the Golden Dataset to prevent the same hallucination in the future.

(For a broader view of how these evaluation frameworks fit into multi-agent systems, read our comprehensive Enterprise Generative AI Guide).


Struggling to move AI from prototype to production? We help enterprises build robust, scalable AI architectures. Book a Generative AI Readiness Assessment.

Back to Knowledge Hub

Related Posts

View All Posts »