Azure OpenAI Graders (プレビュー)

2025-05-19

Von Bedeutung

この記事で "(プレビュー)" と付記されている項目は、現在、パブリックプレビュー段階です。このプレビューはサービスレベルアグリーメントなしで提供されており、運用環境ではお勧めしません。特定の機能はサポート対象ではなく、機能が制限されることがあります。詳細については、「 Microsoft Azure プレビューの追加使用条件」を参照してください。

Azure OpenAI Graders は、AI モデルとその出力のパフォーマンスを評価することを目的とした、Azure AI Foundry SDK で利用できる新しい評価グレードのセットです。ラベルグレーダー、文字列チェッカー、テキストの類似性、一般グレーダーなどのこれらのグレーダーは、ローカルまたはリモートで実行できます。各グレーダーは、AI モデル/モデル出力のさまざまな側面を評価する特定の目的を果たします。

AI 支援グレーダーのモデル構成

次のコードスニペットで参照するために、AI 支援グレーダーは次のようにモデル構成を使用します。

import os
from azure.ai.evaluation import AzureOpenAIModelConfiguration
from dotenv import load_dotenv
load_dotenv()

model_config = AzureOpenAIModelConfiguration(
    azure_endpoint=os.environ["AZURE_ENDPOINT"],
    api_key=os.environ.get["AZURE_API_KEY"],
    azure_deployment=os.environ.get("AZURE_DEPLOYMENT_NAME"),
    api_version=os.environ.get("AZURE_API_VERSION"),
)

ラベルグレーダー

AzureOpenAILabelGrader では、カスタムプロンプトを使用して、定義したラベルに基づいて出力を分類するようにモデルに指示します。各ラベルが選択された理由の説明を含む構造化された結果が返されます。

注

最良の結果を得るには、Azure OpenAI GPT o3-mini を使用することをお勧めします。

次のコードスニペットで使用される data.jsonl の例を次に示します。

[
    {
        "query": "What is the importance of choosing the right provider in getting the most value out of your health insurance plan?",
        "ground_truth": "Choosing an in-network provider helps you save money and ensures better, more personalized care. [Northwind_Health_Plus_Benefits_Details-3.pdf]",
        "response": "Choosing the right provider is key to maximizing your health insurance benefits. In-network providers reduce costs, offer better coverage, and support continuity of care, leading to more effective and personalized treatment. [Northwind_Health_Plus_Benefits_Details.pdf][Northwind_Standard_Benefits_Details.pdf]"
    },
    {
        "query": "What should you do when choosing an in-network provider for your health care needs?",
        "ground_truth": "Check with Northwind Health Plus to confirm the provider is in-network, as this helps reduce costs.\n[Northwind_Health_Plus_Benefits_Details-3.pdf]",
        "response": "To choose an in-network provider, confirm they are part of your plan using the Northwind website or by calling their office. Also consider ___location, availability, and office hours. [Northwind_Standard_Benefits_Details.pdf]"
    },
    {
        "query": "What range of in-network providers does Northwind Health Plus offer?",
        "ground_truth": "Northwind Health Plus offers a wide range of in-network providers, including doctors, specialists, hospitals, and pharmacies.\n[Northwind_Health_Plus_Benefits_Details-3.pdf]",
        "response": "Northwind Health Plus offers a wide range of in-network providers, including primary care physicians, specialists, hospitals, and pharmacies [Northwind_Health_Plus_Benefits_Details.pdf]. The exact range of in-network providers may vary depending on the area, but the Northwind Health Plus website provides a searchable directory of all in-network providers in your area [Northwind_Health_Plus_Benefits_Details.pdf]. It is important to check with Northwind Health Plus to ensure that the provider you are considering is in-network [Northwind_Health_Plus_Benefits_Details.pdf]."
    },
    {
        "query": "What is the limitation of in-network providers?",
        "ground_truth": "The limitation of in-network providers is that they may not always be available in every area.\n[Northwind_Health_Plus_Benefits_Details-3.pdf]",
        "response": "The limitation of in-network providers is that they may not accept the amount of payment offered by Northwind Health, which means you may be responsible for a greater portion of the cost [Northwind_Standard_Benefits_Details.pdf]. Additionally, out-of-network providers may not offer additional services or discounts that are available to in-network providers [Northwind_Standard_Benefits_Details.pdf]. It is recommended to choose an in-network provider whenever possible to ensure the best value for your health care expenses [Northwind_Standard_Benefits_Details.pdf].\n\nSources:\n- Northwind_Standard_Benefits_Details.pdf"
    },
    {
        "query": "What resource does Northwind Health Plus provide to find in-network providers in your area?",
        "ground_truth": "The Northwind Health Plus website offers a searchable directory of all in-network providers in your area. This directory is regularly updated, so you can be sure that you are choosing from in-network providers that are available.\n[Northwind_Health_Plus_Benefits_Details-3.pdf]",
        "response": "Northwind Health Plus provides a variety of in-network providers, including primary care physicians, specialists, hospitals, and pharmacies [Northwind_Health_Plus_Benefits_Details.pdf]."
    }
]

ラベルグレーダーの例

from azure.ai.evaluation import AzureOpenAILabelGrader, evaluate

data_file_name="data.jsonl"

#  Evaluation criteria: Determine if the response column contains texts that are "too short", "just right", or "too long" and pass if it is "just right"
label_grader = AzureOpenAILabelGrader(
    model_config=model_config,
    input=[{"content": "{{item.response}}", "role": "user"}
           {"content":"Any text including space that's more than 600 characters are too long, less than 500 characters are too short; 500 to 600 characters are just right.", "role":"user", "type": "message"}],
    labels=["too short", "just right", "too long"],
    passing_labels=["just right"],
    model="gpt-4o",
    name="label",
)

label_grader_evaluation = evaluate(
    data=data_file_name,
    evaluators={
        "label": label_grader
    },
)

ラベルグレーダーの出力

データファイルに含まれるサンプルデータのセットごとに、 True または False の評価結果が返されます。これは、出力が、定義された渡しラベルと一致するかどうかを示します。 scoreは1.0ケースに対してTrueされ、scoreは0.0ケースに対してFalseされます。モデルがデータのラベルを提供した理由は、contentのoutputs.label.sampleにあります。

'outputs.label.sample':
...
...
    'output': [{'role': 'assistant',
      'content': '{"steps":[{"description":"Calculate the number of characters in the user\'s input including spaces.","conclusion":"The provided text contains 575 characters."},{"description":"Evaluate if the character count falls within the given ranges (greater than 600 too long, less than 500 too short, 500 to 600 just right).","conclusion":"The character count falls between 500 and 600, categorized as \'just right.\'"}],"result":"just right"}'}],
...
...
'outputs.label.label_result': 'pass',
'outputs.label.passed': True,
'outputs.label.score': 1.0

個々のデータ評価結果とは別に、グレーダーはデータセット全体の合格率を示すメトリックも返します。

'metrics': {'label.pass_rate': 0.2}, #1/5 in this case

文字列チェッカー

入力テキストを参照値と比較し、大文字と小文字の区別を省略可能にして、完全一致または部分的一致をチェックします。柔軟なテキスト検証とパターンマッチングに役立ちます。

文字列チェッカーの例

from azure.ai.evaluation import AzureOpenAIStringCheckGrader

# Evaluation criteria: Pass if the query column contains "What is"
string_grader = AzureOpenAIStringCheckGrader(
    model_config=model_config,
    input="{{item.query}}",
    name="starts with what is",
    operation="like", # "eq" for equal, "ne" for not equal, "like" for contain, "ilike" for case insensitive contain
    reference="What is",
)

string_grader_evaluation = evaluate(
    data=data_file_name,
    evaluators={
        "string": string_grader
    },
)

文字列チェッカーの出力

データファイルに含まれるサンプルデータのセットごとに、入力テキストがパターンマッチングルールが定義された場合に一致するかどうかを示す True または False の評価結果が返されます。 scoreは1.0ケースに対してTrueされ、scoreは0.0ケースに対してFalseされます。

'outputs.string.string_result': 'pass',
'outputs.string.passed': True,
'outputs.string.score': 1.0

また、グレーダーは、データセット全体の合格率を示すメトリックも返します。

'metrics': {'string.pass_rate': 0.4}, #2/5 in this case

テキストの類似性

入力テキストが参照値とどの程度近いかを評価します。類似度メトリック (fuzzy_match、 BLEU、 ROUGE、 METEORなど) を使用します。テキストの品質やセマンティックの近さを評価する場合に便利です。

テキストの類似性の例

from azure.ai.evaluation import AzureOpenAITextSimilarityGrader

# Evaluation criteria: Pass if response column and ground_truth column similarity score >= 0.5 using "fuzzy_match"
sim_grader = AzureOpenAITextSimilarityGrader(
    model_config=model_config,
    evaluation_metric="fuzzy_match", # support evaluation metrics including: "fuzzy_match", "bleu", "gleu", "meteor", "rouge_1", "rouge_2", "rouge_3", "rouge_4", "rouge_5", "rouge_l", "cosine",
    input="{{item.response}}",
    name="similarity",
    pass_threshold=0.5,
    reference="{{item.ground_truth}}",
)

sim_grader_evaluation = evaluate(
    data=data_file_name,
    evaluators={
        "similarity": sim_grader
    },
)
evaluation

テキストの類似性の出力

データファイルに含まれるサンプルデータのセットごとに、数値の類似性スコアが生成されます。このスコアは 0 から 1 の範囲で、類似性の程度を示し、スコアが高いほど類似性が高くなります。さらに、 True または False の評価結果が返されます。これは、類似度スコアが、グレーダーで定義された評価メトリックに基づいて、指定されたしきい値を満たすか超えているかを示します。

'outputs.similarity.similarity_result': 'pass',
'outputs.similarity.passed': True,
'outputs.similarity.score': 0.6117136659436009

また、グレーダーは、データセット全体の合格率を示すメトリックも返します。

'metrics': {'similarity.pass_rate': 0.4}, #2/5 in this case

一般グレーダー

上級ユーザーには、カスタムグレーダーをインポートまたは定義し、AOAI 一般グレーダーに統合する機能があります。これにより、既存の AOAI 採点者以外の特定の関心領域に基づいて評価を実行できます。 OpenAI StringCheckGrader をインポートし、Foundry SDK で AOAI 一般グレーダーとして実行されるように構築する例を次に示します。

例

from openai.types.graders import StringCheckGrader
from azure.ai.evaluation import AzureOpenAIGrader
 
# Define an string check grader config directly using the OAI SDK
# Evaluation criteria: Pass if query column contains "Northwind"
oai_string_check_grader = StringCheckGrader(
    input="{{item.query}}",
    name="contains hello",
    operation="like",
    reference="Northwind",
    type="string_check"
)
# Plug that into the general grader
general_grader = AzureOpenAIGrader(
    model_config=model_config,
    grader_config=oai_string_check_grader
)
evaluation = evaluate(
    data=data_file_name,
    evaluators={
        "general": general_grader,
    },
)
evaluation

アウトプット

データファイルに含まれるサンプルデータのセットごとに、一般的なグレーダーは 0 から 1 の浮動小数点数である数値スコアを返し、より高いスコアが優れています。カスタムグレーダーの一部として定義された数値のしきい値を指定すると、スコアがしきい値True = しきい値の場合は>、それ以外の場合はFalseも出力されます。

例えば次が挙げられます。

'outputs.general.general_result': 'pass',
'outputs.general.passed': True,
'outputs.general.score': 1.0

個々のデータ評価結果とは別に、グレーダーはデータセット全体の合格率を示すメトリックも返します。

'metrics': {'general.pass_rate': 0.4}, #2/5 in this case

次の方法で共有

Azure OpenAI Graders (プレビュー)

AI 支援グレーダーのモデル構成

ラベル グレーダー

ラベル グレーダーの例

ラベル グレーダーの出力