Retrieval-augmented Generation (RAG) evaluators

2025-05-19

A retrieval-augmented generation (RAG) system tries to generate the most relevant answer consistent with grounding documents in response to a user's query. At a high level, a user's query triggers a search retrieval in the corpus of grounding documents to provide grounding context for the AI model to generate a response. It's important to evaluate:

The relevance of the retrieval results to the user's query: use Document Retrieval if you have labels for query-specific document relevance, or query relevance judgement (qrels) for more accurate measurements. Use Retrieval if you only have the retrieved context, but you don't have such labels and have a higher tolerance for a less fine-grained measurement.
The consistency of the generated response with respect to the grounding documents: use Groundedness if you want to potentially customize the definition of groundedness in our open-source LLM-judge prompt, Groundedness Pro if you want a straightforward definition.
The relevance of the final response to the query: Relevance if you don't have ground truth, and Response Completeness if you have ground truth and don't want your response to miss critical information.

A good way to think about Groundedness and Response Completeness is: groundedness is about the precision aspect of the response that it shouldn't contain content outside of the grounding context, whereas response completeness is about the recall aspect of the response that it shouldn't miss critical information compared to the expected response (ground truth).

Model configuration for AI-assisted evaluators

For reference in the following snippets, the AI-assisted quality evaluators (except for Groundedness Pro) use a model configuration for the LLM-judge:

import os
from azure.ai.evaluation import AzureOpenAIModelConfiguration
from dotenv import load_dotenv
load_dotenv()

model_config = AzureOpenAIModelConfiguration(
    azure_endpoint=os.environ["AZURE_ENDPOINT"],
    api_key=os.environ.get["AZURE_API_KEY"],
    azure_deployment=os.environ.get("AZURE_DEPLOYMENT_NAME"),
    api_version=os.environ.get("AZURE_API_VERSION"),
)

Tip

We recommend using o3-mini for a balance of reasoning capability and cost efficiency.

Retrieval

Retrieval quality is very important given its upstream role in RAG: if the retrieval quality is poor and the response requires corpus-specific knowledge, there's less chance your LLM model gives you a satisfactory answer. RetrievalEvaluator measures the textual quality of retrieval results with an LLM without requiring ground truth (also known as query relevance judgment), which provides value compared to DocumentRetrievalEvaluator measuring ndcg, xdcg, fidelity, and other classical information retrieval metrics that require ground truth. This metric focuses on how relevant the context chunks (encoded as a string) are to address a query and how the most relevant context chunks are surfaced at the top of the list.

Retrieval example

from azure.ai.evaluation import RetrievalEvaluator

retrieval = RetrievalEvaluator(model_config=model_config, threshold=3)
retrieval(
    query="Where was Marie Curie born?", 
    context="Background: 1. Marie Curie was born in Warsaw. 2. Marie Curie was born on November 7, 1867. 3. Marie Curie is a French scientist. ",
)

Retrieval output

The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (a default is set), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.

{
    "retrieval": 5.0,
    "gpt_retrieval": 5.0,
    "retrieval_reason": "The context contains relevant information that directly answers the query about Marie Curie's birthplace, with the most pertinent information placed at the top. Therefore, it fits the criteria for a high relevance score.",
    "retrieval_result": "pass",
    "retrieval_threshold": 3
}

Document retrieval

Retrieval quality is very important given its upstream role in RAG: if the retrieval quality is poor and the response requires corpus-specific knowledge, there's less chance your LLM model gives you a satisfactory answer. Therefore, it's important to use DocumentRetrievalEvaluator to evaluate the retrieval quality but also optimize your search parameters for RAG.

Document Retrieval evaluator measures how well the RAG retrieves the correct documents from the document store. As a composite evaluator useful for RAG scenario with ground truth, it computes a list of useful search quality metrics for debugging your RAG pipelines:

Metric	Category	Description
Fidelity	Search Fidelity	How well the top n retrieved chunks reflect the content for a given query; number of good documents returned out of the total number of known good documents in a dataset
NDCG	Search NDCG	How good are the rankings to an ideal order where all relevant items are at the top of the list.
XDCG	Search XDCG	How good the results are in the top-k documents regardless of scoring of other index documents
Max Relevance N	Search Max Relevance	Maximum relevance in the top-k chunks
Holes	Search Label Sanity	Number of documents with missing query relevance judgments (Ground truth)

To optimize your RAG in a scenario called "parameter sweep", you can use these metrics to calibrate the search parameters for the optimal RAG results. Simply generate different retrieval results for various search parameters such as search algorithms (vector, semantic), top_k, and chunk sizes you're interested in testing. Then use DocumentRetrievalEvaluator to find the search parameters that yield the highest retrieval quality.

Document retrieval example

from azure.ai.evaluation import DocumentRetrievalEvaluator

# these query_relevance_label are given by your human- or LLM-judges.
retrieval_ground_truth = [
    {
        "document_id": "1",
        "query_relevance_label": 4
    },
    {
        "document_id": "2",
        "query_relevance_label": 2
    },
    {
        "document_id": "3",
        "query_relevance_label": 3
    },
    {
        "document_id": "4",
        "query_relevance_label": 1
    },
    {
        "document_id": "5",
        "query_relevance_label": 0
    },
]
# the min and max of the label scores are inputs to document retrieval evaluator
ground_truth_label_min = 0
ground_truth_label_max = 4

# these relevance scores come from your search retrieval system
retrieved_documents = [
    {
        "document_id": "2",
        "relevance_score": 45.1
    },
    {
        "document_id": "6",
        "relevance_score": 35.8
    },
    {
        "document_id": "3",
        "relevance_score": 29.2
    },
    {
        "document_id": "5",
        "relevance_score": 25.4
    },
    {
        "document_id": "7",
        "relevance_score": 18.8
    },
]

document_retrieval_evaluator = DocumentRetrievalEvaluator(
    ground_truth_label_min=ground_truth_label_min, 
    ground_truth_label_max=ground_truth_label_max,
    ndcg_threshold = 0.5,
    xdcg_threshold = 50.0,
    fidelity_threshold = 0.5,
    top1_relevance_threshold = 50.0,
    top3_max_relevance_threshold = 50.0,
    total_retrieved_documents_threshold = 50,
    total_ground_truth_documents_threshold = 50
)
document_retrieval_evaluator(retrieval_ground_truth=retrieval_ground_truth, retrieved_documents=retrieved_documents)

Document retrieval output

All numerical scores have high_is_better=True except for holes and holes_ratio which have high_is_better=False. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise.

{
    "ndcg@3": 0.6461858173,
    "xdcg@3": 37.7551020408,
    "fidelity": 0.0188438199,
    "top1_relevance": 2,
    "top3_max_relevance": 2,
    "holes": 30,
    "holes_ratio": 0.6000000000000001,
    "holes_higher_is_better": False,
    "holes_ratio_higher_is_better": False,
    "total_retrieved_documents": 50,
    "total_groundtruth_documents": 1565,
    "ndcg@3_result": "pass",
    "xdcg@3_result": "pass",
    "fidelity_result": "fail",
    "top1_relevance_result": "fail",
    "top3_max_relevance_result": "fail",
    # omitting more fields ...
}

Groundedness

It's important to evaluate how grounded the response is in relation to the context, because AI models can fabricate content or generate irrelevant responses. GroundednessEvaluator measures how well the generated response aligns with the given context (grounding source) and doesn't fabricate content outside of it. This metric captures the precision aspect of response alignment with the grounding source. Lower score means the response is irrelevant to the query or fabricated inaccurate content outside the context. This metric is complementary to ResponseCompletenessEvaluator that captures the recall aspect of response alignment with the expected response.

Groundedness example

from azure.ai.evaluation import GroundednessEvaluator

groundedness = GroundednessEvaluator(model_config=model_config, threshold=3)
groundedness(
    query="Is Marie Curie is born in Paris?", 
    context="Background: 1. Marie Curie is born on November 7, 1867. 2. Marie Curie is born in Warsaw.",
    response="No, Marie Curie is born in Warsaw."
)

Groundedness output

The numerical score on a likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.

{
    "groundedness": 5.0,  
    "gpt_groundedness": 5.0,
    "groundedness_reason": "The RESPONSE accurately answers the QUERY by confirming that Marie Curie was born in Warsaw, which is supported by the CONTEXT. It does not include any irrelevant or incorrect information, making it a complete and relevant answer. Thus, it deserves a high score for groundedness.",
    "groundedness_result": "pass", 
    "groundedness_threshold": 3
}

Groundedness Pro

AI systems can fabricate content or generate irrelevant responses outside the given context. Powered by Azure AI Content Safety, GroundednessProEvaluator detects whether the generated text response is consistent or accurate with respect to the given context in a retrieval-augmented generation question-and-answering scenario. It checks whether the response adheres closely to the context in order to answer the query, avoiding speculation or fabrication, and outputs a binary label.

Groundedness Pro example

import os
from azure.ai.evaluation import GroundednessProEvaluator
from dotenv import load_dotenv
load_dotenv()

## Using Azure AI Foundry Hub
azure_ai_project = {
    "subscription_id": os.environ.get("AZURE_SUBSCRIPTION_ID"),
    "resource_group_name": os.environ.get("AZURE_RESOURCE_GROUP"),
    "project_name": os.environ.get("AZURE_PROJECT_NAME"),
}
## Using Azure AI Foundry Development Platform, example: AZURE_AI_PROJECT=https://your-account.services.ai.azure.com/api/projects/your-project
azure_ai_project = os.environ.get("AZURE_AI_PROJECT")

groundedness_pro = GroundednessProEvaluator(azure_ai_project=azure_ai_project), 
groundedness_pro(
    query="Is Marie Curie is born in Paris?", 
    context="Background: 1. Marie Curie is born on November 7, 1867. 2. Marie Curie is born in Warsaw.",
    response="No, Marie Curie is born in Warsaw."
)

Groundedness Pro output

The label field returns a boolean score True if all content in the response is completely grounded in the context and False otherwise. Use the reason field to understand more about the judgment behind the score.

{
    "groundedness_pro_reason": "All Contents are grounded",
    "groundedness_pro_label": True
}

Relevance

It's important to evaluate the final response because AI models can generate irrelevant responses with respect to a user query. To address this, you can use RelevanceEvaluator which measures how effectively a response addresses a query. It assesses the accuracy, completeness, and direct relevance of the response based solely on the given query. Higher scores mean better relevance.

Relevance example

from azure.ai.evaluation import RelevanceEvaluator

relevance = RelevanceEvaluator(model_config=model_config, threshold=3)
relevance(
    query="Is Marie Curie is born in Paris?", 
    response="No, Marie Curie is born in Warsaw."
)

Relevance output

{
    "relevance": 4.0,
    "gpt_relevance": 4.0, 
    "relevance_reason": "The RESPONSE accurately answers the QUERY by stating that Marie Curie was born in Warsaw, which is correct and directly relevant to the question asked.",
    "relevance_result": "pass", 
    "relevance_threshold": 3
}

Response completeness

AI systems can fabricate content or generate irrelevant responses outside the given context. Given ground truth response, ResponseCompletenessEvaluator that captures the recall aspect of response alignment with the expected response. This is complementary to GroundednessEvaluator which captures the precision aspect of response alignment with the grounding source.

Response completeness example

from azure.ai.evaluation import ResponseCompletenessEvaluator

response_completeness = ResponseCompletenessEvaluator(model_config=model_config, threshold=3)
response_completeness(
    response="Based on the retrieved documents, the shareholder meeting discussed the operational efficiency of the company and financing options.",
    ground_truth="The shareholder meeting discussed the compensation package of the company CEO."
)

Response completeness output

{
    "response_completeness": 1,
    "response_completeness_result": "fail",
    "response_completeness_threshold": 3,
    "response_completeness_reason": "The response does not contain any relevant information from the ground truth, which specifically discusses the CEO's compensation package. Therefore, it is considered fully incomplete."
}

Share via