Trustworthy RAG with LlamaIndex and Cleanlab¶
Large Language Models (LLMs) occasionally hallucinate incorrect answers, especially for questions not well-supported within their training data. While organizations are rapidly adopting Retrieval Augmented Generation (RAG) to pair the power of LLMs with their own proprietary data, hallucinated/incorrect RAG responses remain a problem.
This tutorial demonstrates a solution to build trustworthy RAG applications: use Cleanlab to score the trustworthiness of every LLM response and diagnose why certain responses are untrustworthy via additional evaluations.
Cleanlab boosts the reliability of any LLM application by scoring when LLM responses are untrusworthy. Trust scoring happens in real-time and does not require any data labeling or model training work. Cleanlab provides additional real-time Evals for other RAG components like the retrieved context, which help you diagnose why certain RAG responses and inaccurate/bad. Cleanlab's detection enables you to automatically flag/prevent inaccurate/bad responses from your RAG app, and avoid losing your users' trust.
Setup¶
This tutorial requires a:
- Cleanlab API Key: Sign up at tlm.cleanlab.ai/ to get a free key
- OpenAI API Key: To make completion requests to an LLM
Start by installing the required dependencies.
%pip install llama-index cleanlab-tlm
import os, re
from typing import List, ClassVar
import pandas as pd
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from cleanlab_tlm import TrustworthyRAG, Eval, get_default_evals
Initialize the OpenAI client using its API key.
os.environ["OPENAI_API_KEY"] = "<your-openai-api-key>"
llm = OpenAI(model="gpt-4o-mini")
embed_model = OpenAIEmbedding(embed_batch_size=10)
Now, we initialize Cleanlab's client with default configurations. You can achieve better detection accuracy and latency by adjusting optional configurations.
os.environ["CLEANLAB_TLM_API_KEY"] = "<your-cleanlab-api-key"
trustworthy_rag = (
TrustworthyRAG()
) # Optional configurations can improve accuracy/latency
Read data¶
This tutorial uses Nvidia’s Q1 FY2024 earnings report as an example data source for populating the RAG application's knowledge base.
!wget -nc 'https://cleanlab-public.s3.amazonaws.com/Datasets/NVIDIA_Financial_Results_Q1_FY2024.md'
!mkdir -p ./data
!mv NVIDIA_Financial_Results_Q1_FY2024.md data/
--2025-05-07 16:13:28-- https://cleanlab-public.s3.amazonaws.com/Datasets/NVIDIA_Financial_Results_Q1_FY2024.md Resolving cleanlab-public.s3.amazonaws.com (cleanlab-public.s3.amazonaws.com)... 54.231.236.193, 16.182.70.65, 52.217.14.204, ... Connecting to cleanlab-public.s3.amazonaws.com (cleanlab-public.s3.amazonaws.com)|54.231.236.193|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 7379 (7.2K) [binary/octet-stream] Saving to: ‘NVIDIA_Financial_Results_Q1_FY2024.md’ NVIDIA_Financial_Re 100%[===================>] 7.21K --.-KB/s in 0s 2025-05-07 16:13:28 (97.7 MB/s) - ‘NVIDIA_Financial_Results_Q1_FY2024.md’ saved [7379/7379]
with open(
"data/NVIDIA_Financial_Results_Q1_FY2024.md", "r", encoding="utf-8"
) as file:
data = file.read()
print(data[:200])
# NVIDIA Announces Financial Results for First Quarter Fiscal 2024 NVIDIA (NASDAQ: NVDA) today reported revenue for the first quarter ended April 30, 2023, of $7.19 billion, down 13% from a year ago
Build a RAG pipeline¶
Now let's build a simple RAG pipeline with LlamaIndex. We have already initialized the OpenAI API for both as LLM and Embedding model.
from llama_index.core import Settings, VectorStoreIndex, SimpleDirectoryReader
Settings.llm = llm
Settings.embed_model = embed_model
Load Data and Create Index + Query Engine¶
Let's create an index from the document we just pulled above. We stick with the default index from LlamaIndex for this tutorial.
documents = SimpleDirectoryReader("data").load_data()
# Optional step since we're loading just one data file
for doc in documents:
doc.excluded_llm_metadata_keys.append(
"file_path"
) # file_path wouldn't be a useful metadata to add to LLM's context since our datasource contains just 1 file
index = VectorStoreIndex.from_documents(documents)
The generated index is used to power a query engine over the data.
query_engine = index.as_query_engine()
Note that Cleanlab is agnostic to the index and the query engine used for RAG, and is compatible with any choices you make for these components of your system.
In addition, you can just use Cleanlab in an existing custom-built RAG pipeline (using any other LLM generator, streaming or not).
Cleanlab just needs the prompt sent to your LLM (including system instructions, retrieved context, user query, etc.) and the generated response.
We define an event handler that stores the prompt that LlamaIndex sends to the LLM. Refer to the instrumentation documentation for more details.
from llama_index.core.instrumentation import get_dispatcher
from llama_index.core.instrumentation.events import BaseEvent
from llama_index.core.instrumentation.event_handlers import BaseEventHandler
from llama_index.core.instrumentation.events.llm import LLMPredictStartEvent
class PromptEventHandler(BaseEventHandler):
events: ClassVar[List[BaseEvent]] = []
PROMPT_TEMPLATE: str = ""
@classmethod
def class_name(cls) -> str:
return "PromptEventHandler"
def handle(self, event) -> None:
if isinstance(event, LLMPredictStartEvent):
self.PROMPT_TEMPLATE = event.template.default_template.template
self.events.append(event)
# Root dispatcher
root_dispatcher = get_dispatcher()
# Register event handler
event_handler = PromptEventHandler()
root_dispatcher.add_event_handler(event_handler)
For each query, we can fetch the prompt from event_handler.PROMPT_TEMPLATE
. Let's see it in action.
Use our RAG application¶
Now that the vector database is loaded with text chunks and their corresponding embeddings, we can start querying it to answer questions.
query = "What was NVIDIA's total revenue in the first quarter of fiscal 2024?"
response = query_engine.query(query)
print(response)
NVIDIA's total revenue in the first quarter of fiscal 2024 was $7.19 billion.
This response is indeed correct for our simple query. Let's see the document chunks that LlamaIndex retrieved for this query, from which we can easy verify this response was right.
def get_retrieved_context(response, print_chunks=False):
if isinstance(response, list):
texts = [node.text for node in response]
else:
texts = [src.node.text for src in response.source_nodes]
if print_chunks:
for idx, text in enumerate(texts):
print(f"--- Chunk {idx + 1} ---\n{text[:200]}...")
return "\n".join(texts)
context_str = get_retrieved_context(response, True)
--- Chunk 1 --- # NVIDIA Announces Financial Results for First Quarter Fiscal 2024 NVIDIA (NASDAQ: NVDA) today reported revenue for the first quarter ended April 30, 2023, of $7.19 billion, down 13% from a year ago ... --- Chunk 2 --- - **Gross Margins**: GAAP and non-GAAP gross margins are expected to be 68.6% and 70.0%, respectively, plus or minus 50 basis points. - **Operating Expenses**: GAAP and non-GAAP operating expenses are...
Add a Trust Layer with Cleanlab¶
Let's add a detection layer to catch untrustworthy RAG responses in real-time. TrustworthyRAG runs Cleanlab's state-of-the-art LLM uncertainty estimator, the Trustworthy Language Model, to provide a trustworthiness score indicating overall confidence that your RAG's response is correct.
TrustworthyRAG can simultaneously run additional evaluations to diagnose why responses are likely incorrect or other types of issues.
Let's see what Evals are run by default:
default_evals = get_default_evals()
for eval in default_evals:
print(f"{eval.name}")
context_sufficiency response_groundedness response_helpfulness query_ease
Each Eval returns a score between 0-1 (higher is better) that assesses a different aspect of your RAG system:
context_sufficiency: Evaluates whether the retrieved context contains sufficient information to completely answer the query. A low score indicates that key information is missing from the context (perhaps due to poor retrieval or missing documents).
response_groundedness: Evaluates whether claims/information stated in the response are explicitly supported by the provided context.
response_helpfulness: Evaluates whether the response effectively addresses the user query and appears helpful.
query_ease: Evaluates whether the user query seems easy for an AI system to properly handle. Complex, vague, tricky, or disgruntled-sounding queries receive lower scores.
To run TrustworthyRAG, we need the prompt sent to the LLM, which includes the system message, retrieved chunks, the user's query, and the LLM's response. The event handler defined above provides this prompt. Let's define a helper function to run Cleanlab's detection.
# Helper function to run real-time Evals
def get_eval(query, response, event_handler, evaluator):
# Get context used by LLM to generate response
context = get_retrieved_context(response)
# Get prompt template used to build the prompt
pt = event_handler.PROMPT_TEMPLATE
# Build prompt
full_prompt = pt.format(context_str=context, query_str=query)
eval_result = evaluator.score(
query=query,
context=context,
response=response.response,
prompt=full_prompt,
)
# Evaluate the response using TrustworthyRAG
print("Evaluation results:")
for metric, value in eval_result.items():
print(f"{metric}: {value['score']}")
# Helper function run end-to-end RAG
def get_answer(query, evaluator=trustworthy_rag, event_handler=event_handler):
response = query_engine.query(query)
print(
f"Query:\n{query}\n\nTrimmed Context:\n{get_retrieved_context(response)[:300]}..."
)
print(f"\nGenerated response:\n{response.response}\n")
get_eval(query, response, event_handler, evaluator)
get_eval(query, response, event_handler, trustworthy_rag)
Evaluation results: trustworthiness: 1.0 context_sufficiency: 0.9975124377856721 response_groundedness: 0.9975124378045552 response_helpfulness: 0.9975124367363073 query_ease: 0.9975071027792313
Analysis: The high trustworthiness_score
indicates this response is very trustworthy, i.e. non-hallucinated and likely correct. The context that was retrieved here is sufficient to answer this query, as reflected by the high context_sufficiency
score. The high query_ease
score indicates this is a straightforward query as well.
Now let’s run a challenging query that cannot be answered using the only document in our RAG application's knowledge base.
get_answer(
"How does the report explain why NVIDIA's Gaming revenue decreased year over year?"
)
Query: How does the report explain why NVIDIA's Gaming revenue decreased year over year? Trimmed Context: # NVIDIA Announces Financial Results for First Quarter Fiscal 2024 NVIDIA (NASDAQ: NVDA) today reported revenue for the first quarter ended April 30, 2023, of $7.19 billion, down 13% from a year ago and up 19% from the previous quarter. - **Quarterly revenue** of $7.19 billion, up 19% from the pre... Generated response: The report indicates that NVIDIA's Gaming revenue decreased year over year by 38%, which is attributed to a combination of factors, although specific reasons are not detailed. The context highlights that the revenue for the first quarter was $2.24 billion, down from the previous year, while it did show an increase of 22% from the previous quarter. This suggests that while there may have been a seasonal or cyclical recovery, the overall year-over-year decline reflects challenges in the gaming segment during that period. Evaluation results: trustworthiness: 0.8018049078305449 context_sufficiency: 0.26134514055082803 response_groundedness: 0.8147481620994604 response_helpfulness: 0.28647897539109127 query_ease: 0.952132218665045
Analysis: The generator LLM avoids conjecture by providing a reliable response, as seen in the high trustworthiness_score
. The low context_sufficiency
score reflects that the retrieved context was lacking, and the response doesn’t actually answer the user’s query, as indicated by the low response_helpfulness
.
Let’s see how our RAG system responds to another challenging question.
get_answer(
"How much did Nvidia's revenue decrease this quarter vs last quarter, in dollars?"
)
Query: How much did Nvidia's revenue decrease this quarter vs last quarter, in dollars? Trimmed Context: # NVIDIA Announces Financial Results for First Quarter Fiscal 2024 NVIDIA (NASDAQ: NVDA) today reported revenue for the first quarter ended April 30, 2023, of $7.19 billion, down 13% from a year ago and up 19% from the previous quarter. - **Quarterly revenue** of $7.19 billion, up 19% from the pre... Generated response: NVIDIA's revenue decreased by $1.10 billion this quarter compared to the last quarter. Evaluation results: trustworthiness: 0.572441384819641 context_sufficiency: 0.9974990573223977 response_groundedness: 0.006136548076912901 response_helpfulness: 0.997512230771839 query_ease: 0.8018484929561781
Analysis: The generated response incorrectly states that NVIDIA's revenue decreased this quarter, when in fact the referenced report notes a 19% increase quarter-over-quarter.
Cleanlab's low trustworthiness score helps us automatically catch this incorrect RAG response in real-time! To root-cause why this response was untrustworthy, we see the response_groundedness
score is low, which indicates our LLM model is to blame for fabricating this false information.
Let's try another one:
get_answer(
"If NVIDIA's Data Center segment maintains its Q1 FY2024 quarter-over-quarter growth rate for the next four quarters, what would be its projected annual revenue?"
)
Query: If NVIDIA's Data Center segment maintains its Q1 FY2024 quarter-over-quarter growth rate for the next four quarters, what would be its projected annual revenue? Trimmed Context: # NVIDIA Announces Financial Results for First Quarter Fiscal 2024 NVIDIA (NASDAQ: NVDA) today reported revenue for the first quarter ended April 30, 2023, of $7.19 billion, down 13% from a year ago and up 19% from the previous quarter. - **Quarterly revenue** of $7.19 billion, up 19% from the pre... Generated response: If NVIDIA's Data Center segment maintains its quarter-over-quarter growth rate of 18% from Q1 FY2024 for the next four quarters, the projected revenue for the next four quarters can be calculated as follows: 1. Q1 FY2024 revenue: $4.28 billion 2. Q2 FY2024 projected revenue: $4.28 billion * 1.18 = $5.05 billion 3. Q3 FY2024 projected revenue: $5.05 billion * 1.18 = $5.95 billion 4. Q4 FY2024 projected revenue: $5.95 billion * 1.18 = $7.02 billion 5. Q5 FY2024 projected revenue: $7.02 billion * 1.18 = $8.27 billion Now, summing these revenues for the five quarters (including Q1 FY2024): - Total projected revenue = $4.28 billion + $5.05 billion + $5.95 billion + $7.02 billion + $8.27 billion = $30.57 billion Therefore, the projected annual revenue for the Data Center segment would be approximately $30.57 billion. Evaluation results: trustworthiness: 0.23124932848015411 context_sufficiency: 0.9299227307108295 response_groundedness: 0.31247206392894905 response_helpfulness: 0.9975055879546202 query_ease: 0.7724662723193096
Analysis: Reviewing the generated response, we find it overstates (sums up the financials of Q1) the projected revenue. Again Cleanlab helps us automatically catch this incorrect response via its low trustworthiness_score
. Based on the additional Evals, the root cause of this issue again appears to be the LLM model failing to ground its response in the retrieved context.
Custom Evals¶
You can also specify custom evaluations to assess specific criteria, and combine them with the default evaluations for comprehensive/tailored assessment of your RAG system.
For instance, here's how to create and run a custom eval that checks the conciseness of the generated response.
conciseness_eval = Eval(
name="response_conciseness",
criteria="Evaluate whether the Generated response is concise and to the point without unnecessary verbosity or repetition. A good response should be brief but comprehensive, covering all necessary information without extra words or redundant explanations.",
response_identifier="Generated Response",
)
# Combine default evals with a custom eval
combined_evals = get_default_evals() + [conciseness_eval]
# Initialize TrustworthyRAG with combined evals
combined_trustworthy_rag = TrustworthyRAG(evals=combined_evals)
get_answer(
"What significant transitions did Jensen comment on?",
evaluator=combined_trustworthy_rag,
)
Query: What significant transitions did Jensen comment on? Trimmed Context: # NVIDIA Announces Financial Results for First Quarter Fiscal 2024 NVIDIA (NASDAQ: NVDA) today reported revenue for the first quarter ended April 30, 2023, of $7.19 billion, down 13% from a year ago and up 19% from the previous quarter. - **Quarterly revenue** of $7.19 billion, up 19% from the pre... Generated response: Jensen Huang commented on the significant transitions the computer industry is undergoing, particularly in the areas of accelerated computing and generative AI. Evaluation results: trustworthiness: 0.9810004109697261 context_sufficiency: 0.9902170786836257 response_groundedness: 0.9975123614036665 response_helpfulness: 0.9420916924086002 query_ease: 0.5334109647649754 response_conciseness: 0.842668665703559
Replace your LLM with Cleanlab's¶
Beyond evaluating responses already generated from your LLM, Cleanlab can also generate responses and evaluate them simultaneously (using one of many supported models).
This replaces your own LLM within your RAG system and can be more convenient/accurate/faster.
Let's replace our OpenAI LLM to call Cleanlab's endpoint instead:
query = "How much did Nvidia's revenue decrease this quarter vs last quarter, in dollars?"
relevant_chunks = query_engine.retrieve(query)
context = get_retrieved_context(relevant_chunks)
print(f"Query:\n{query}\n\nTrimmed Context:\n{context[:300]}")
pt = event_handler.PROMPT_TEMPLATE
full_prompt = pt.format(context_str=context, query_str=query)
result = trustworthy_rag.generate(
query=query, context=context, prompt=full_prompt
)
print(f"\nGenerated Response:\n{result['response']}\n")
print("Evaluation Scores:")
for metric, value in result.items():
if metric != "response":
print(f"{metric}: {value['score']}")
Query: How much did Nvidia's revenue decrease this quarter vs last quarter, in dollars? Trimmed Context: # NVIDIA Announces Financial Results for First Quarter Fiscal 2024 NVIDIA (NASDAQ: NVDA) today reported revenue for the first quarter ended April 30, 2023, of $7.19 billion, down 13% from a year ago and up 19% from the previous quarter. - **Quarterly revenue** of $7.19 billion, up 19% from the pre Generated Response: NVIDIA's revenue for the first quarter of fiscal 2024 was $7.19 billion, and for the previous quarter (Q4 FY23), it was $6.05 billion. Therefore, the revenue increased by $1.14 billion from the previous quarter, not decreased. So, the revenue did not decrease this quarter vs last quarter; it actually increased by $1.14 billion. Evaluation Scores: trustworthiness: 0.6810414232214796 context_sufficiency: 0.9974887437375295 response_groundedness: 0.9975116791816968 response_helpfulness: 0.3293002430120912 query_ease: 0.33275910932109172
While it remains hard to achieve a RAG application that will accurately answer any possible question, you can easily use Cleanlab to deploy a trustworthy RAG application which at least flags answers that are likely inaccurate. Learn more about optional configurations you can adjust to improve accuracy/latency in the Cleanlab documentation.