在本地使用 Azure AI 评估 SDK 来评估生成式 AI 应用程序

2025-05-20

重要

本文中标记了“（预览版）”的项目目前为公共预览版。此预览版未提供服务级别协议，不建议将其用于生产工作负载。某些功能可能不受支持或者受限。有关详细信息，请参阅 Microsoft Azure 预览版补充使用条款。

若要在生成式 AI 应用程序应用于大型数据集时全面评估其性能，可以使用 Azure AI 评估 SDK 在开发环境中评估生成式 AI 应用程序。假设有一个测试数据集或目标，生成式 AI 应用程序代系通过基于数学的指标以及 AI 辅助的质量和安全评估器进行量化度量。内置或自定义评估器可以提供对应用程序功能和限制的全面见解。

在本文中，你将学习如何通过本地使用 Azure AI 评估 SDK，在单行数据和较大测试数据集上对应用程序目标运行评估器，然后在 Azure AI 项目中跟踪结果和评估日志。

入门指南

首先从 Azure AI 评估 SDK 安装评估器包：

pip install azure-ai-evaluation

注释

有关详细信息，请参阅 Azure AI 评估 SDK 的 API 参考文档。

内置评估器

类别	计算器
常规用途	`CoherenceEvaluator`、`FluencyEvaluator`、`QAEvaluator`
文本相似性	`SimilarityEvaluator`、`F1ScoreEvaluator`、`BleuScoreEvaluator`、`GleuScoreEvaluator`、`RougeScoreEvaluator`、`MeteorScoreEvaluator`
Retrieval-Augmented Generation （RAG）	`RetrievalEvaluator`、`DocumentRetrievalEvaluator`、`GroundednessEvaluator`、`GroundednessProEvaluator`、`RelevanceEvaluator`、`ResponseCompletenessEvaluator`
风险和安全	`ViolenceEvaluator`、`SexualEvaluator`、`SelfHarmEvaluator`、`HateUnfairnessEvaluator`、`IndirectAttackEvaluator`、`ProtectedMaterialEvaluator`、`UngroundedAttributesEvaluator`、`CodeVulnerabilityEvaluator`、`ContentSafetyEvaluator`
Agentic	`IntentResolutionEvaluator`、`ToolCallAccuracyEvaluator`、`TaskAdherenceEvaluator`
Azure OpenAI	`AzureOpenAILabelGrader`、`AzureOpenAIStringCheckGrader`、`AzureOpenAITextSimilarityGrader`、`AzureOpenAIGrader`

内置的质量和安全指标采用查询和响应对，以及特定评估器的附加信息。

内置评估程序的数据要求

内置评估器可以接受查询和响应对和/或 jsonl 格式的对话列表。

对话和单轮文本支持	对话和单轮文本和图像支持	仅为文本提供单轮支持
`GroundednessEvaluator`、`GroundednessProEvaluator`、`RetrievalEvaluator`、`DocumentRetrievalEvaluator`、`RelevanceEvaluator`、`CoherenceEvaluator`、`FluencyEvaluator`、`ResponseCompletenessEvaluator`、`IndirectAttackEvaluator`、`AzureOpenAILabelGrader`、`AzureOpenAIStringCheckGrader`、`AzureOpenAITextSimilarityGrader`、`AzureOpenAIGrader`	`ViolenceEvaluator`、`SexualEvaluator`、`SelfHarmEvaluator`、`HateUnfairnessEvaluator`、`ProtectedMaterialEvaluator`、`ContentSafetyEvaluator`	`UngroundedAttributesEvaluator`、`CodeVulnerabilityEvaluator`、`ResponseCompletenessEvaluator`、`SimilarityEvaluator`、`F1ScoreEvaluator`、`RougeScoreEvaluator`、`GleuScoreEvaluator`、`BleuScoreEvaluator`、`MeteorScoreEvaluator`、`QAEvaluator`

注释

除 SimilarityEvaluator 以外的 AI 辅助质量评估程序带有原因字段。它们采用包括思维链推理在内的技术来生成对分数的解释。因此，由于评估质量的提高，它们会在生成过程中消耗更多的标记使用量。具体而言，对于所有 AI 辅助评估程序，评估程序生成的 max_token 已设置为 800（对于 RetrievalEvaluator，该值则设置为 1600，以适应更长的输入）。

注释

Azure OpenAI 评分员需要一个模板，用于描述其输入列如何转换为评分员使用的“真实”输入。示例：如果有两个名为“query”和“response”的输入，并且采用如下格式的模板： {{item.query}}，则仅使用查询。类似地，你可以使用类似于 {{item.conversation}} 的语句来接受对话输入，但系统处理该输入的功能取决于如何配置评分器的其余部分以期待该输入。

有关代理评估程序的数据要求的详细信息，请转到使用 Azure AI 评估 SDK 在本地运行代理评估。

对文本的单轮支持

所有内置评估器在字符串中的查询和响应对中采用单轮输入，例如：

from azure.ai.evaluation import RelevanceEvaluator

query = "What is the cpital of life?"
response = "Paris."

# Initializing an evaluator
relevance_eval = RelevanceEvaluator(model_config)
relevance_eval(query=query, response=response)

若要使用本地评估运行批处理评估或上传数据集以运行云评估，需要以 .jsonl 格式表示数据集。上述单轮数据（查询和响应对）等效于数据集行，如下所示（我们以三行为例）：

{"query":"What is the capital of France?","response":"Paris."}
{"query":"What atoms compose water?","response":"Hydrogen and oxygen."}
{"query":"What color is my shirt?","response":"Blue."}

评估测试数据集可以包含以下内容，具体取决于每个内置计算器的要求：

查询：发送到生成式 AI 应用程序的查询
响应：生成式 AI 应用程序生成的查询响应
上下文：生成的响应所基于的源（即基础文档）
基本事实：由用户/人类生成的响应（作为真实答案）

若要查看每个评估器所需的内容，可以在内置评估器文档中了解详细信息。

文本中的对话支持

对于支持文本对话的评估程序，可以提供 conversation 作为输入，这是一个包含 messages 列表（包括 content、role，并可选择包括 context）的 Python 字典。

Python 中的两回合对话示例：

conversation = {
        "messages": [
        {
            "content": "Which tent is the most waterproof?", 
            "role": "user"
        },
        {
            "content": "The Alpine Explorer Tent is the most waterproof",
            "role": "assistant", 
            "context": "From the our product list the alpine explorer tent is the most waterproof. The Adventure Dining Table has higher weight."
        },
        {
            "content": "How much does it cost?",
            "role": "user"
        },
        {
            "content": "The Alpine Explorer Tent is $120.",
            "role": "assistant",
            "context": None
        }
        ]
}

若要使用本地评估运行批处理评估或上传数据集以运行云评估，需要以 .jsonl 格式表示数据集。上一个对话等效于 .jsonl 文件中如下所示的数据集行：

{"conversation":
    {
        "messages": [
        {
            "content": "Which tent is the most waterproof?", 
            "role": "user"
        },
        {
            "content": "The Alpine Explorer Tent is the most waterproof",
            "role": "assistant", 
            "context": "From the our product list the alpine explorer tent is the most waterproof. The Adventure Dining Table has higher weight."
        },
        {
            "content": "How much does it cost?",
            "role": "user"
        },
        {
            "content": "The Alpine Explorer Tent is $120.",
            "role": "assistant",
            "context": null
        }
        ]
    }
}

我们的评估程序明白，对话的第一轮以查询-响应格式提供来自 query 的有效 user、来自 context 的 assistant 以及来自 response 的 assistant。然后将按轮次评估对话，结果会按所有轮次聚合以得出对话分数。

注释

在第二轮中，即使 context 为 null 或一个缺失键，它也会被解释为空字符串而不是失败并出现错误，这可能会导致误导性结果。强烈建议你验证评估数据以符合数据要求。

对于对话模式，下面提供了一个 GroundednessEvaluator 的示例：

# Conversation mode
import json
import os
from azure.ai.evaluation import GroundednessEvaluator, AzureOpenAIModelConfiguration

model_config = AzureOpenAIModelConfiguration(
    azure_endpoint=os.environ.get("AZURE_ENDPOINT"),
    api_key=os.environ.get("AZURE_API_KEY"),
    azure_deployment=os.environ.get("AZURE_DEPLOYMENT_NAME"),
    api_version=os.environ.get("AZURE_API_VERSION"),
)

# Initializing Groundedness and Groundedness Pro evaluators
groundedness_eval = GroundednessEvaluator(model_config)

conversation = {
    "messages": [
        { "content": "Which tent is the most waterproof?", "role": "user" },
        { "content": "The Alpine Explorer Tent is the most waterproof", "role": "assistant", "context": "From the our product list the alpine explorer tent is the most waterproof. The Adventure Dining Table has higher weight." },
        { "content": "How much does it cost?", "role": "user" },
        { "content": "$120.", "role": "assistant", "context": "The Alpine Explorer Tent is $120."}
    ]
}

# alternatively, you can load the same content from a .jsonl file
groundedness_conv_score = groundedness_eval(conversation=conversation)
print(json.dumps(groundedness_conv_score, indent=4))

对于对话输出，每轮次结果都存储在一个列表中，并且整体会话分数 'groundedness': 4.0 是基于这些轮次计算的平均值：

{
    "groundedness": 5.0,
    "gpt_groundedness": 5.0,
    "groundedness_threshold": 3.0,
    "evaluation_per_turn": {
        "groundedness": [
            5.0,
            5.0
        ],
        "gpt_groundedness": [
            5.0,
            5.0
        ],
        "groundedness_reason": [
            "The response accurately and completely answers the query by stating that the Alpine Explorer Tent is the most waterproof, which is directly supported by the context. There are no irrelevant details or incorrect information present.",
            "The RESPONSE directly answers the QUERY with the exact information provided in the CONTEXT, making it fully correct and complete."
        ],
        "groundedness_result": [
            "pass",
            "pass"
        ],
        "groundedness_threshold": [
            3,
            3
        ]
    }
}

注释

我们强烈建议用户迁移其代码以使用不带前缀的键（例如 groundedness.groundedness），从而允许代码支持更多评估程序模型。

对于支持图像和多模式图像与文本中的对话的评估程序，可以在 conversation 中传入图像 URL 或 base64 编码图像。

下面是受支持的场景示例：

输入多个图像与文本，生成图像或文本
仅输入文本，生成相应的图像
仅输入图像，生成相应的文本

from pathlib import Path
from azure.ai.evaluation import ContentSafetyEvaluator
import base64

# instantiate an evaluator with image and multi-modal support
safety_evaluator = ContentSafetyEvaluator(credential=azure_cred, azure_ai_project=project_scope)

# example of a conversation with an image URL
conversation_image_url = {
    "messages": [
        {
            "role": "system",
            "content": [
                {"type": "text", "text": "You are an AI assistant that understands images."}
            ],
        },
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Can you describe this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://cdn.britannica.com/68/178268-050-5B4E7FB6/Tom-Cruise-2013.jpg"
                    },
                },
            ],
        },
        {
            "role": "assistant",
            "content": [
                {
                    "type": "text",
                    "text": "The image shows a man with short brown hair smiling, wearing a dark-colored shirt.",
                }
            ],
        },
    ]
}

# example of a conversation with base64 encoded images
base64_image = ""

with Path.open("Image1.jpg", "rb") as image_file:
    base64_image = base64.b64encode(image_file.read()).decode("utf-8")

conversation_base64 = {
    "messages": [
        {"content": "create an image of a branded apple", "role": "user"},
        {
            "content": [{"type": "image_url", "image_url": {"url": f"data:image/jpg;base64,{base64_image}"}}],
            "role": "assistant",
        },
    ]
}

# run the evaluation on the conversation to output the result
safety_score = safety_evaluator(conversation=conversation_image_url)

目前，图像和多模式评估程序支持：

单轮对话（对话仅包含 1 条用户消息和 1 条助手消息）
对话仅包含 1 条系统消息
对话有效负载不得超过 10 MB（包括图像）
绝对 URL 和 Base64 编码图像
单轮对话中的多个图像
JPG/JPEG、PNG、GIF 文件格式

设置

对于除 GroundednessProEvaluator （预览版）以外的 AI 辅助质量评估程序，必须在 gpt-35-turbo 中指定一个GPT 模型（gpt-4、gpt-4-turbo、gpt-4o、gpt-4o-mini 或 model_config）来充当评判员，从而对评估数据进行评分。我们同时支持 Azure OpenAI 或 OpenAI 模型配置架构。我们建议使用不处于预览版本的 GPT 模型，以获得最佳性能和评估程序可分析的响应。

注释

强烈建议将 gpt-3.5-turbo 替换为评估器模型的 gpt-4o-mini，因为根据 OpenAI，后者更便宜、更有能力，而且速度同样快。

请确保至少具有 Azure OpenAI 资源的 Cognitive Services OpenAI User 角色，以便使用 API 密钥进行推理调用。若要了解有关权限的详细信息，请参阅 Azure OpenAI 资源的权限。

对于所有风险与安全评估器和 GroundednessProEvaluator（预览版），必须提供 model_config 信息，而不是 azure_ai_project 中的 GPT 部署。这会通过 Azure AI 项目访问后端评估服务。

AI 辅助内置评估器的提示

我们开放了质量评估器的提示，并在我们的评估器库和 Azure AI 评估 Python SDK 存储库中发布，以增加透明度。安全评估器和 GroundednessProEvaluator （由 Azure AI 内容安全提供支持）除外。这些提示充当语言模型执行其评估任务的说明，这需要对指标及其关联的评分标准进行人工友好的定义。我们强烈建议用户根据其场景具体情况自定义定义和评分标准。请参阅自定义评估程序中的详细信息。

复合评估器

复合评估器是内置评估器，它将单个质量或安全指标组合在一起，可为查询响应对或聊天消息轻松提供各种现成指标。

复合评估器	包含	DESCRIPTION
`QAEvaluator`	`GroundednessEvaluator`、`RelevanceEvaluator`、`CoherenceEvaluator`、`FluencyEvaluator`、`SimilarityEvaluator`、`F1ScoreEvaluator`	将所有质量评估器组合为查询和响应对的单个组合指标输出
`ContentSafetyEvaluator`	`ViolenceEvaluator`、`SexualEvaluator`、`SelfHarmEvaluator`、`HateUnfairnessEvaluator`	将所有安全评估器组合为查询和响应对的单个组合指标输出

使用 `evaluate()` 对测试数据集进行本地评估

在单行数据上抽样检查内置或自定义评估器后，可以对整个测试数据集组合使用多个评估器和 evaluate() API。

Azure AI Foundry 项目的设置前提步骤

如果这是首次运行评估并将其记录到 Azure AI Foundry 项目，则可能需要执行一些额外的设置步骤。

创建存储帐户并将其连接到资源级别的 Azure AI Foundry 项目。此 bicep 模板预配存储帐户，并使用密钥身份验证将存储帐户连接到 Foundry 项目。
确保连接的存储帐户有权访问所有项目。
如果使用 Microsoft Entra ID 连接到存储帐户，请确保在 Azure 门户中向帐户和 Foundry 项目资源授予存储 Blob 数据所有者的 MSI（Microsoft 标识）权限。

评估数据集并将结果记录到 Azure AI Foundry

为了确保 evaluate() 能够正确分析数据，必须指定列映射，以便将列从数据集映射到评估器接受的关键字。在本例中，我们指定了 query、response 和 context 的数据映射。

from azure.ai.evaluation import evaluate

result = evaluate(
    data="data.jsonl", # provide your data here
    evaluators={
        "groundedness": groundedness_eval,
        "answer_length": answer_length
    },
    # column mapping
    evaluator_config={
        "groundedness": {
            "column_mapping": {
                "query": "${data.queries}",
                "context": "${data.context}",
                "response": "${data.response}"
            } 
        }
    },
    # Optionally provide your Azure AI Foundry project information to track your evaluation results in your project portal
    azure_ai_project = azure_ai_project,
    # Optionally provide an output path to dump a json of metric summary, row level data and metric and Azure AI project URL
    output_path="./myevalresults.json"
)

小窍门

获取链接的 result.studio_url 属性的内容，以便在 Azure AI 项目中查看记录的评估结果。

评估器在字典中输出结果，其包含聚合 metrics 和行级数据和指标。输出示例：

{'metrics': {'answer_length.value': 49.333333333333336,
             'groundedness.gpt_groundeness': 5.0, 'groundedness.groundeness': 5.0},
 'rows': [{'inputs.response': 'Paris is the capital of France.',
           'inputs.context': 'Paris has been the capital of France since '
                                  'the 10th century and is known for its '
                                  'cultural and historical landmarks.',
           'inputs.query': 'What is the capital of France?',
           'outputs.answer_length.value': 31,
           'outputs.groundeness.groundeness': 5,
           'outputs.groundeness.gpt_groundeness': 5,
           'outputs.groundeness.groundeness_reason': 'The response to the query is supported by the context.'},
          {'inputs.response': 'Albert Einstein developed the theory of '
                            'relativity.',
           'inputs.context': 'Albert Einstein developed the theory of '
                                  'relativity, with his special relativity '
                                  'published in 1905 and general relativity in '
                                  '1915.',
           'inputs.query': 'Who developed the theory of relativity?',
           'outputs.answer_length.value': 51,
           'outputs.groundeness.groundeness': 5,
           'outputs.groundeness.gpt_groundeness': 5,
           'outputs.groundeness.groundeness_reason': 'The response to the query is supported by the context.'},
          {'inputs.response': 'The speed of light is approximately 299,792,458 '
                            'meters per second.',
           'inputs.context': 'The exact speed of light in a vacuum is '
                                  '299,792,458 meters per second, a constant '
                                  "used in physics to represent 'c'.",
           'inputs.query': 'What is the speed of light?',
           'outputs.answer_length.value': 66,
           'outputs.groundeness.groundeness': 5,
           'outputs.groundeness.gpt_groundeness': 5,
           'outputs.groundeness.groundeness_reason': 'The response to the query is supported by the context.'}],
 'traces': {}}

`evaluate()` 的要求：

evaluate() API 对它接受的数据格式以及它处理评估程序参数键名称的方式有一些要求，以便 Azure AI 项目中的评估结果图表正确显示。

数据格式

evaluate() API 仅接受 JSONLines 格式的数据。对于所有内置评估器，evaluate() 需要采用以下格式的数据以及所需的输入字段。请参阅上一部分，了解内置评估器所需的数据输入。一行的示例如下所示：

{
  "query":"What is the capital of France?",
  "context":"France is in Europe",
  "response":"Paris is the capital of France.",
  "ground_truth": "Paris"
}

评估程序参数格式

传入内置评估程序时，请务必在 evaluators 参数列表中指定正确的关键字映射。下表是在记录到 Azure AI 项目时内置评估程序的结果显示在 UI 中所需的关键字映射。

计算器	关键字参数
`GroundednessEvaluator`	"groundedness"
`GroundednessProEvaluator`	"groundedness_pro"
`RetrievalEvaluator`	"retrieval"
`RelevanceEvaluator`	“相关性”
`CoherenceEvaluator`	“一致性”
`FluencyEvaluator`	"fluency"
`SimilarityEvaluator`	"similarity"
`F1ScoreEvaluator`	"f1_score"
`RougeScoreEvaluator`	“rouge”
`GleuScoreEvaluator`	“gleu”
`BleuScoreEvaluator`	“bleu”
`MeteorScoreEvaluator`	“meteor”
`ViolenceEvaluator`	“暴力”
`SexualEvaluator`	"sexual"
`SelfHarmEvaluator`	"self_harm"
`HateUnfairnessEvaluator`	"hate_unfairness"
`IndirectAttackEvaluator`	"indirect_attack"
`ProtectedMaterialEvaluator`	"protected_material"
`CodeVulnerabilityEvaluator`	“代码漏洞”
`UngroundedAttributesEvaluator`	"ungrounded_attributes"
`QAEvaluator`	"qa"
`ContentSafetyEvaluator`	"content_safety"

下面是设置 evaluators 参数的示例：

result = evaluate(
    data="data.jsonl",
    evaluators={
        "sexual":sexual_evaluator
        "self_harm":self_harm_evaluator
        "hate_unfairness":hate_unfairness_evaluator
        "violence":violence_evaluator
    }
)

目标上的本地评估

如果有要运行然后评估的查询列表，则 evaluate() 还支持 target 参数，该参数可将查询发送到应用程序以收集回答，然后对生成的查询和响应运行评估器。

目标可以是目录中的任何可调用类。在这种情况下，我们有一个具有可调用类askwiki.py的 Python 脚本askwiki()，我们可以将其设置为目标。假设我们有可以发送到简单 askwiki 应用的查询的数据集，则我们可以评估输出的有据性。确保在 "column_mapping" 中为数据指定正确的列映射。可以使用 "default" 来为所有评估器指定列映射。

下面是“data.jsonl”中的内容：

{"query":"When was United Stated found ?", "response":"1776"}
{"query":"What is the capital of France?", "response":"Paris"}
{"query":"Who is the best tennis player of all time ?", "response":"Roger Federer"}

from askwiki import askwiki

result = evaluate(
    data="data.jsonl",
    target=askwiki,
    evaluators={
        "groundedness": groundedness_eval
    },
    evaluator_config={
        "default": {
            "column_mapping": {
                "query": "${data.queries}"
                "context": "${outputs.context}"
                "response": "${outputs.response}"
            } 
        }
    }
)

通过