你当前正在访问 Microsoft Azure Global Edition 技术文档网站。如果需要访问由世纪互联运营的 Microsoft Azure 中国技术文档网站，请访问 https://docs.azure.cn。

Azure OpenAI 评分者（预览版）

2025-05-20

重要

本文中标记了“（预览版）”的项目目前为公共预览版。此预览版未提供服务级别协议，不建议将其用于生产工作负载。某些功能可能不受支持或者受限。有关详细信息，请参阅 Microsoft Azure 预览版补充使用条款。

Azure OpenAI 评分者是 Azure AI Foundry SDK 中提供的一组新的评估评分员，旨在评估 AI 模型及其输出的性能。这些评分员，包括标签评分员、字符串检查器、文本相似性和常规评分程序，都可以在本地或远程运行。每个评分员在评估 AI 模型/模型输出的不同方面都具有特定用途。

AI 辅助评分器的模型配置

为了在以下代码片段中参考，AI 辅助分级程序使用模型配置，如下所示：

import os
from azure.ai.evaluation import AzureOpenAIModelConfiguration
from dotenv import load_dotenv
load_dotenv()

model_config = AzureOpenAIModelConfiguration(
    azure_endpoint=os.environ["AZURE_ENDPOINT"],
    api_key=os.environ.get["AZURE_API_KEY"],
    azure_deployment=os.environ.get("AZURE_DEPLOYMENT_NAME"),
    api_version=os.environ.get("AZURE_API_VERSION"),
)

标签分级器

AzureOpenAILabelGrader 使用自定义提示指示模型根据定义的标签对输出进行分类。它返回结构化结果，并说明选择每个标签的原因。

注释

建议使用 Azure OpenAI GPT o3-mini 获得最佳结果。

下面是以下代码片段中使用的示例 data.jsonl ：

[
    {
        "query": "What is the importance of choosing the right provider in getting the most value out of your health insurance plan?",
        "ground_truth": "Choosing an in-network provider helps you save money and ensures better, more personalized care. [Northwind_Health_Plus_Benefits_Details-3.pdf]",
        "response": "Choosing the right provider is key to maximizing your health insurance benefits. In-network providers reduce costs, offer better coverage, and support continuity of care, leading to more effective and personalized treatment. [Northwind_Health_Plus_Benefits_Details.pdf][Northwind_Standard_Benefits_Details.pdf]"
    },
    {
        "query": "What should you do when choosing an in-network provider for your health care needs?",
        "ground_truth": "Check with Northwind Health Plus to confirm the provider is in-network, as this helps reduce costs.\n[Northwind_Health_Plus_Benefits_Details-3.pdf]",
        "response": "To choose an in-network provider, confirm they are part of your plan using the Northwind website or by calling their office. Also consider ___location, availability, and office hours. [Northwind_Standard_Benefits_Details.pdf]"
    },
    {
        "query": "What range of in-network providers does Northwind Health Plus offer?",
        "ground_truth": "Northwind Health Plus offers a wide range of in-network providers, including doctors, specialists, hospitals, and pharmacies.\n[Northwind_Health_Plus_Benefits_Details-3.pdf]",
        "response": "Northwind Health Plus offers a wide range of in-network providers, including primary care physicians, specialists, hospitals, and pharmacies [Northwind_Health_Plus_Benefits_Details.pdf]. The exact range of in-network providers may vary depending on the area, but the Northwind Health Plus website provides a searchable directory of all in-network providers in your area [Northwind_Health_Plus_Benefits_Details.pdf]. It is important to check with Northwind Health Plus to ensure that the provider you are considering is in-network [Northwind_Health_Plus_Benefits_Details.pdf]."
    },
    {
        "query": "What is the limitation of in-network providers?",
        "ground_truth": "The limitation of in-network providers is that they may not always be available in every area.\n[Northwind_Health_Plus_Benefits_Details-3.pdf]",
        "response": "The limitation of in-network providers is that they may not accept the amount of payment offered by Northwind Health, which means you may be responsible for a greater portion of the cost [Northwind_Standard_Benefits_Details.pdf]. Additionally, out-of-network providers may not offer additional services or discounts that are available to in-network providers [Northwind_Standard_Benefits_Details.pdf]. It is recommended to choose an in-network provider whenever possible to ensure the best value for your health care expenses [Northwind_Standard_Benefits_Details.pdf].\n\nSources:\n- Northwind_Standard_Benefits_Details.pdf"
    },
    {
        "query": "What resource does Northwind Health Plus provide to find in-network providers in your area?",
        "ground_truth": "The Northwind Health Plus website offers a searchable directory of all in-network providers in your area. This directory is regularly updated, so you can be sure that you are choosing from in-network providers that are available.\n[Northwind_Health_Plus_Benefits_Details-3.pdf]",
        "response": "Northwind Health Plus provides a variety of in-network providers, including primary care physicians, specialists, hospitals, and pharmacies [Northwind_Health_Plus_Benefits_Details.pdf]."
    }
]

标签分级程序示例

from azure.ai.evaluation import AzureOpenAILabelGrader, evaluate

data_file_name="data.jsonl"

#  Evaluation criteria: Determine if the response column contains texts that are "too short", "just right", or "too long" and pass if it is "just right"
label_grader = AzureOpenAILabelGrader(
    model_config=model_config,
    input=[{"content": "{{item.response}}", "role": "user"}
           {"content":"Any text including space that's more than 600 characters are too long, less than 500 characters are too short; 500 to 600 characters are just right.", "role":"user", "type": "message"}],
    labels=["too short", "just right", "too long"],
    passing_labels=["just right"],
    model="gpt-4o",
    name="label",
)

label_grader_evaluation = evaluate(
    data=data_file_name,
    evaluators={
        "label": label_grader
    },
)

标签分级器输出

对于数据文件中包含的每个示例数据集，如果输出与定义的传递标签匹配，则True返回或返回评估结果False。对于score事例1.0True，则score为0.0False案例。模型提供数据contentoutputs.label.sample标签的原因。

'outputs.label.sample':
...
...
    'output': [{'role': 'assistant',
      'content': '{"steps":[{"description":"Calculate the number of characters in the user\'s input including spaces.","conclusion":"The provided text contains 575 characters."},{"description":"Evaluate if the character count falls within the given ranges (greater than 600 too long, less than 500 too short, 500 to 600 just right).","conclusion":"The character count falls between 500 and 600, categorized as \'just right.\'"}],"result":"just right"}'}],
...
...
'outputs.label.label_result': 'pass',
'outputs.label.passed': True,
'outputs.label.score': 1.0

除了单个数据评估结果之外，评分器还返回一个指示总体数据集通过率的指标。

'metrics': {'label.pass_rate': 0.2}, #1/5 in this case

字符串检查器

将输入文本与引用值进行比较，检查是否与可选的不区分大小写的完全匹配项或部分匹配项。适用于灵活的文本验证和模式匹配。

字符串检查器示例

from azure.ai.evaluation import AzureOpenAIStringCheckGrader

# Evaluation criteria: Pass if the query column contains "What is"
string_grader = AzureOpenAIStringCheckGrader(
    model_config=model_config,
    input="{{item.query}}",
    name="starts with what is",
    operation="like", # "eq" for equal, "ne" for not equal, "like" for contain, "ilike" for case insensitive contain
    reference="What is",
)

string_grader_evaluation = evaluate(
    data=data_file_name,
    evaluators={
        "string": string_grader
    },
)

字符串检查器输出

对于数据文件中包含的每个示例数据集，如果输入文本与定义的模式匹配规则匹配，则True返回或返回评估结果False。对于score事例1.0True，则score为0.0False案例。

'outputs.string.string_result': 'pass',
'outputs.string.passed': True,
'outputs.string.score': 1.0

评分器还返回指示总体数据集通过率的指标。

'metrics': {'string.pass_rate': 0.4}, #2/5 in this case

文本相似性

使用相似性指标（如fuzzy_match、BLEU或ROUGEMETEOR）评估接近输入文本与引用值匹配的方式。用于评估文本质量或语义接近性。

文本相似性示例

from azure.ai.evaluation import AzureOpenAITextSimilarityGrader

# Evaluation criteria: Pass if response column and ground_truth column similarity score >= 0.5 using "fuzzy_match"
sim_grader = AzureOpenAITextSimilarityGrader(
    model_config=model_config,
    evaluation_metric="fuzzy_match", # support evaluation metrics including: "fuzzy_match", "bleu", "gleu", "meteor", "rouge_1", "rouge_2", "rouge_3", "rouge_4", "rouge_5", "rouge_l", "cosine",
    input="{{item.response}}",
    name="similarity",
    pass_threshold=0.5,
    reference="{{item.ground_truth}}",
)

sim_grader_evaluation = evaluate(
    data=data_file_name,
    evaluators={
        "similarity": sim_grader
    },
)
evaluation

文本相似性输出

对于数据文件中包含的每组示例数据，将生成数字相似性分数。此分数（范围为 0 到 1）表示相似性程度，分数越高表示相似性越高。此外，返回或True返回评估结果False，表示相似性分数是否符合或超过指定阈值，具体取决于在评分器中定义的评估指标。

'outputs.similarity.similarity_result': 'pass',
'outputs.similarity.passed': True,
'outputs.similarity.score': 0.6117136659436009