你当前正在访问 Microsoft Azure Global Edition 技术文档网站。 如果需要访问由世纪互联运营的 Microsoft Azure 中国技术文档网站,请访问 https://docs.azure.cn。
重要
本文中标记了“(预览版)”的项目目前为公共预览版。 此预览版未提供服务级别协议,不建议将其用于生产工作负载。 某些功能可能不受支持或者受限。 有关详细信息,请参阅 Microsoft Azure 预览版补充使用条款。
Azure OpenAI 评分者是 Azure AI Foundry SDK 中提供的一组新的评估评分员,旨在评估 AI 模型及其输出的性能。 这些评分员,包括标签评分员、字符串检查器、文本相似性和常规评分程序,都可以在本地或远程运行。 每个评分员在评估 AI 模型/模型输出的不同方面都具有特定用途。
AI 辅助评分器的模型配置
为了在以下代码片段中参考,AI 辅助分级程序使用模型配置,如下所示:
import os
from azure.ai.evaluation import AzureOpenAIModelConfiguration
from dotenv import load_dotenv
load_dotenv()
model_config = AzureOpenAIModelConfiguration(
azure_endpoint=os.environ["AZURE_ENDPOINT"],
api_key=os.environ.get["AZURE_API_KEY"],
azure_deployment=os.environ.get("AZURE_DEPLOYMENT_NAME"),
api_version=os.environ.get("AZURE_API_VERSION"),
)
标签分级器
AzureOpenAILabelGrader
使用自定义提示指示模型根据定义的标签对输出进行分类。 它返回结构化结果,并说明选择每个标签的原因。
注释
建议使用 Azure OpenAI GPT o3-mini 获得最佳结果。
下面是以下代码片段中使用的示例 data.jsonl
:
[
{
"query": "What is the importance of choosing the right provider in getting the most value out of your health insurance plan?",
"ground_truth": "Choosing an in-network provider helps you save money and ensures better, more personalized care. [Northwind_Health_Plus_Benefits_Details-3.pdf]",
"response": "Choosing the right provider is key to maximizing your health insurance benefits. In-network providers reduce costs, offer better coverage, and support continuity of care, leading to more effective and personalized treatment. [Northwind_Health_Plus_Benefits_Details.pdf][Northwind_Standard_Benefits_Details.pdf]"
},
{
"query": "What should you do when choosing an in-network provider for your health care needs?",
"ground_truth": "Check with Northwind Health Plus to confirm the provider is in-network, as this helps reduce costs.\n[Northwind_Health_Plus_Benefits_Details-3.pdf]",
"response": "To choose an in-network provider, confirm they are part of your plan using the Northwind website or by calling their office. Also consider ___location, availability, and office hours. [Northwind_Standard_Benefits_Details.pdf]"
},
{
"query": "What range of in-network providers does Northwind Health Plus offer?",
"ground_truth": "Northwind Health Plus offers a wide range of in-network providers, including doctors, specialists, hospitals, and pharmacies.\n[Northwind_Health_Plus_Benefits_Details-3.pdf]",
"response": "Northwind Health Plus offers a wide range of in-network providers, including primary care physicians, specialists, hospitals, and pharmacies [Northwind_Health_Plus_Benefits_Details.pdf]. The exact range of in-network providers may vary depending on the area, but the Northwind Health Plus website provides a searchable directory of all in-network providers in your area [Northwind_Health_Plus_Benefits_Details.pdf]. It is important to check with Northwind Health Plus to ensure that the provider you are considering is in-network [Northwind_Health_Plus_Benefits_Details.pdf]."
},
{
"query": "What is the limitation of in-network providers?",
"ground_truth": "The limitation of in-network providers is that they may not always be available in every area.\n[Northwind_Health_Plus_Benefits_Details-3.pdf]",
"response": "The limitation of in-network providers is that they may not accept the amount of payment offered by Northwind Health, which means you may be responsible for a greater portion of the cost [Northwind_Standard_Benefits_Details.pdf]. Additionally, out-of-network providers may not offer additional services or discounts that are available to in-network providers [Northwind_Standard_Benefits_Details.pdf]. It is recommended to choose an in-network provider whenever possible to ensure the best value for your health care expenses [Northwind_Standard_Benefits_Details.pdf].\n\nSources:\n- Northwind_Standard_Benefits_Details.pdf"
},
{
"query": "What resource does Northwind Health Plus provide to find in-network providers in your area?",
"ground_truth": "The Northwind Health Plus website offers a searchable directory of all in-network providers in your area. This directory is regularly updated, so you can be sure that you are choosing from in-network providers that are available.\n[Northwind_Health_Plus_Benefits_Details-3.pdf]",
"response": "Northwind Health Plus provides a variety of in-network providers, including primary care physicians, specialists, hospitals, and pharmacies [Northwind_Health_Plus_Benefits_Details.pdf]."
}
]
标签分级程序示例
from azure.ai.evaluation import AzureOpenAILabelGrader, evaluate
data_file_name="data.jsonl"
# Evaluation criteria: Determine if the response column contains texts that are "too short", "just right", or "too long" and pass if it is "just right"
label_grader = AzureOpenAILabelGrader(
model_config=model_config,
input=[{"content": "{{item.response}}", "role": "user"}
{"content":"Any text including space that's more than 600 characters are too long, less than 500 characters are too short; 500 to 600 characters are just right.", "role":"user", "type": "message"}],
labels=["too short", "just right", "too long"],
passing_labels=["just right"],
model="gpt-4o",
name="label",
)
label_grader_evaluation = evaluate(
data=data_file_name,
evaluators={
"label": label_grader
},
)
标签分级器输出
对于数据文件中包含的每个示例数据集,如果输出与定义的传递标签匹配,则True
返回或返回评估结果False
。 对于score
事例1.0
True
,则score
为0.0
False
案例。 模型提供数据content
outputs.label.sample
标签的原因。
'outputs.label.sample':
...
...
'output': [{'role': 'assistant',
'content': '{"steps":[{"description":"Calculate the number of characters in the user\'s input including spaces.","conclusion":"The provided text contains 575 characters."},{"description":"Evaluate if the character count falls within the given ranges (greater than 600 too long, less than 500 too short, 500 to 600 just right).","conclusion":"The character count falls between 500 and 600, categorized as \'just right.\'"}],"result":"just right"}'}],
...
...
'outputs.label.label_result': 'pass',
'outputs.label.passed': True,
'outputs.label.score': 1.0
除了单个数据评估结果之外,评分器还返回一个指示总体数据集通过率的指标。
'metrics': {'label.pass_rate': 0.2}, #1/5 in this case
字符串检查器
将输入文本与引用值进行比较,检查是否与可选的不区分大小写的完全匹配项或部分匹配项。 适用于灵活的文本验证和模式匹配。
字符串检查器示例
from azure.ai.evaluation import AzureOpenAIStringCheckGrader
# Evaluation criteria: Pass if the query column contains "What is"
string_grader = AzureOpenAIStringCheckGrader(
model_config=model_config,
input="{{item.query}}",
name="starts with what is",
operation="like", # "eq" for equal, "ne" for not equal, "like" for contain, "ilike" for case insensitive contain
reference="What is",
)
string_grader_evaluation = evaluate(
data=data_file_name,
evaluators={
"string": string_grader
},
)
字符串检查器输出
对于数据文件中包含的每个示例数据集,如果输入文本与定义的模式匹配规则匹配,则True
返回或返回评估结果False
。 对于score
事例1.0
True
,则score
为0.0
False
案例。
'outputs.string.string_result': 'pass',
'outputs.string.passed': True,
'outputs.string.score': 1.0
评分器还返回指示总体数据集通过率的指标。
'metrics': {'string.pass_rate': 0.4}, #2/5 in this case
文本相似性
使用相似性指标(如fuzzy_match
、BLEU
或ROUGE
METEOR
)评估接近输入文本与引用值匹配的方式。 用于评估文本质量或语义接近性。
文本相似性示例
from azure.ai.evaluation import AzureOpenAITextSimilarityGrader
# Evaluation criteria: Pass if response column and ground_truth column similarity score >= 0.5 using "fuzzy_match"
sim_grader = AzureOpenAITextSimilarityGrader(
model_config=model_config,
evaluation_metric="fuzzy_match", # support evaluation metrics including: "fuzzy_match", "bleu", "gleu", "meteor", "rouge_1", "rouge_2", "rouge_3", "rouge_4", "rouge_5", "rouge_l", "cosine",
input="{{item.response}}",
name="similarity",
pass_threshold=0.5,
reference="{{item.ground_truth}}",
)
sim_grader_evaluation = evaluate(
data=data_file_name,
evaluators={
"similarity": sim_grader
},
)
evaluation
文本相似性输出
对于数据文件中包含的每组示例数据,将生成数字相似性分数。 此分数(范围为 0 到 1)表示相似性程度,分数越高表示相似性越高。 此外,返回或True
返回评估结果False
,表示相似性分数是否符合或超过指定阈值,具体取决于在评分器中定义的评估指标。
'outputs.similarity.similarity_result': 'pass',
'outputs.similarity.passed': True,
'outputs.similarity.score': 0.6117136659436009
评分器还返回指示总体数据集通过率的指标。
'metrics': {'similarity.pass_rate': 0.4}, #2/5 in this case
常规分级程序
高级用户能够导入或定义自定义分级程序,并将其集成到 AOAI 常规评分器中。 这允许根据现有 AOAI 评分员以外的特定兴趣领域执行评估。 下面是一个导入 OpenAI StringCheckGrader
并将其构造为在 Foundry SDK 上作为 AOAI 常规分级程序运行的示例。
示例:
from openai.types.graders import StringCheckGrader
from azure.ai.evaluation import AzureOpenAIGrader
# Define an string check grader config directly using the OAI SDK
# Evaluation criteria: Pass if query column contains "Northwind"
oai_string_check_grader = StringCheckGrader(
input="{{item.query}}",
name="contains hello",
operation="like",
reference="Northwind",
type="string_check"
)
# Plug that into the general grader
general_grader = AzureOpenAIGrader(
model_config=model_config,
grader_config=oai_string_check_grader
)
evaluation = evaluate(
data=data_file_name,
evaluators={
"general": general_grader,
},
)
evaluation
输出
对于数据文件中包含的每组示例数据,常规评分器返回一个数值分数,即 0-1 浮点数,分数越高。 给定定义为自定义评分器的一部分的数字阈值,我们还在分数 True
= 阈值时>输出,否则也会输出False
。
例如:
'outputs.general.general_result': 'pass',
'outputs.general.passed': True,
'outputs.general.score': 1.0
除了单个数据评估结果之外,评分器还返回一个指示总体数据集通过率的指标。
'metrics': {'general.pass_rate': 0.4}, #2/5 in this case