お使いの生成 AI アプリケーションを Azure AI 評価 SDK を使用してローカルで評価する

2025-05-20

重要

この記事で "(プレビュー)" と付記されている項目は、現在、パブリックプレビュー段階です。このプレビューはサービスレベルアグリーメントなしで提供されており、運用環境ではお勧めしません。特定の機能はサポート対象ではなく、機能が制限されることがあります。詳しくは、Microsoft Azure プレビューの追加使用条件に関するページをご覧ください。

大量のデータセットに適用した場合の生成 AI アプリケーションのパフォーマンスを徹底的に評価するために、Azure AI 評価 SDK を使用して開発環境で生成 AI アプリケーションを評価できます。テストデータセットまたはターゲットを指定すると、生成 AI アプリケーションのパフォーマンスが、数学ベースのメトリックと、AI 支援の品質および安全性エバリュエーターの両方で定量的に測定されます。組み込みまたはカスタムのエバリュエーターを使用すると、アプリケーションの機能と制限に関する包括的な分析情報を得ることができます。

この記事では、1 行のデータに対してエバリュエーターを実行し、Azure AI 評価 SDK を使用して組み込みのエバリュエーターを使用するアプリケーションターゲット上の大規模なテストデータセットをローカルで実行し、Azure AI プロジェクトの結果と評価ログを追跡する方法について説明します。

作業の開始

まず、Azure AI 評価 SDK からエバリュエータパッケージをインストールします。

pip install azure-ai-evaluation

注

詳細については、 Azure AI Evaluation SDK の API リファレンスドキュメントを参照してください。

組み込みエバリュエーター

カテゴリ	エバリュエーター
汎用	`CoherenceEvaluator`、 `FluencyEvaluator`、 `QAEvaluator`
テキストの類似性	`SimilarityEvaluator`、 `F1ScoreEvaluator`、 `BleuScoreEvaluator`、 `GleuScoreEvaluator`、 `RougeScoreEvaluator`、 `MeteorScoreEvaluator`
検索拡張生成 (RAG)	`RetrievalEvaluator`、 `DocumentRetrievalEvaluator`、 `GroundednessEvaluator`、 `GroundednessProEvaluator`、 `RelevanceEvaluator`、 `ResponseCompletenessEvaluator`
リスクと安全性	`ViolenceEvaluator`、 `SexualEvaluator`、 `SelfHarmEvaluator`、 `HateUnfairnessEvaluator`、 `IndirectAttackEvaluator`、 `ProtectedMaterialEvaluator`、 `UngroundedAttributesEvaluator`、 `CodeVulnerabilityEvaluator`、 `ContentSafetyEvaluator`
Agentic	`IntentResolutionEvaluator`、 `ToolCallAccuracyEvaluator`、 `TaskAdherenceEvaluator`
Azure OpenAI	`AzureOpenAILabelGrader`、 `AzureOpenAIStringCheckGrader`、 `AzureOpenAITextSimilarityGrader`、 `AzureOpenAIGrader`

組み込みの品質と安全性のメトリックは、クエリと応答のペアに加えて、特定のエバリュエータに関する追加情報を受け取ります。

組み込みのエバリュエータのデータ要件

組み込みのエバリュエーターは、クエリと応答のペア、またはjsonl形式または両方の会話のリストを受け取ることができます。

テキストの会話型 "および" 単一ターンのサポート	テストと画像の会話型 "および" 単一ターンのサポート	テキストのみの一回限りのサポート
`GroundednessEvaluator`、 `GroundednessProEvaluator`、 `RetrievalEvaluator`、 `DocumentRetrievalEvaluator`、`RelevanceEvaluator`、 `CoherenceEvaluator`、 `FluencyEvaluator`、 `ResponseCompletenessEvaluator`、 `IndirectAttackEvaluator`、 `AzureOpenAILabelGrader`、 `AzureOpenAIStringCheckGrader`、 `AzureOpenAITextSimilarityGrader`、 `AzureOpenAIGrader`	`ViolenceEvaluator`、 `SexualEvaluator`、 `SelfHarmEvaluator`、 `HateUnfairnessEvaluator`、 `ProtectedMaterialEvaluator`、 `ContentSafetyEvaluator`	`UngroundedAttributesEvaluator`、`CodeVulnerabilityEvaluator`、`ResponseCompletenessEvaluator`、`SimilarityEvaluator`、`F1ScoreEvaluator`、`RougeScoreEvaluator`、`GleuScoreEvaluator`、`BleuScoreEvaluator`、`MeteorScoreEvaluator`、`QAEvaluator`

注

SimilarityEvaluator を除き、AI 支援型の品質エバリュエータには理由フィールドが付属します。スコアに対する説明を生成するため、思考の連鎖推論などの手法が採用されています。そのため、評価品質が向上した結果、生成時により多くのトークン使用量が消費されます。具体的には、すべての AI 支援型エバリュエータに対してエバリュエータ生成の max_token が 800 に設定されています (そして、長い入力に対応するため、RetrievalEvaluator には 1600)。

注

Azure OpenAI の採点者には、入力列が、グレーダーが使用する "実際の" 入力にどのように変換されるかを説明するテンプレートが必要です。例: "query" と "response" という 2 つの入力があり、次のように書式設定されたテンプレートがある場合は、 {{item.query}}、クエリのみが使用されます。同様に、会話入力を受け入れる {{item.conversation}} のようなものを持つことができますが、システムが処理する機能は、その入力を期待するように他のグレーダーを構成する方法によって異なります。

エージェントエバリュエーターのデータ要件の詳細については、「 Azure AI Evaluation SDK を使用してエージェント評価をローカルで実行する」を参照してください。

テキストの1回限りのサポート

すべての組み込みエバリュエーターは、文字列内のクエリと応答のペアのように単一ターン入力を受け取ります。次に例を示します。

from azure.ai.evaluation import RelevanceEvaluator

query = "What is the cpital of life?"
response = "Paris."

# Initializing an evaluator
relevance_eval = RelevanceEvaluator(model_config)
relevance_eval(query=query, response=response)

ローカル評価を使用してバッチ評価を実行するか、データセットをアップロードしてクラウド評価を実行するには、データセットを .jsonl 形式で表す必要があります。前の単一ターンデータ (クエリと応答のペア) は、次のようにデータセットの行に相当します (例として 3 行を示します)。

{"query":"What is the capital of France?","response":"Paris."}
{"query":"What atoms compose water?","response":"Hydrogen and oxygen."}
{"query":"What color is my shirt?","response":"Blue."}

評価テストデータセットには、各組み込みエバリュエーターの要件に応じて、以下を含めることができます。

クエリ: 生成 AI アプリケーションに送信されるクエリ
応答: クエリに対して生成 AI アプリケーションによって生成される応答
コンテキスト: 生成された応答の基になるソース (つまり、基礎ドキュメント)
グラウンドトゥルース: 答えが true の、ユーザーまたは人間によって生成された応答

各エバリュエーターに必要なものを確認するには、組み込みのエバリュエータードキュメントで詳細を確認できます。

テキストの会話サポート

テキストの会話をサポートするエバリュエータの場合、入力として conversation、つまり messages (content、role、必要に応じて context を含む) のリストである Python ディクショナリを指定できます。

Python での 2 ターン会話の例:

conversation = {
        "messages": [
        {
            "content": "Which tent is the most waterproof?", 
            "role": "user"
        },
        {
            "content": "The Alpine Explorer Tent is the most waterproof",
            "role": "assistant", 
            "context": "From the our product list the alpine explorer tent is the most waterproof. The Adventure Dining Table has higher weight."
        },
        {
            "content": "How much does it cost?",
            "role": "user"
        },
        {
            "content": "The Alpine Explorer Tent is $120.",
            "role": "assistant",
            "context": None
        }
        ]
}

ローカル評価を使用してバッチ評価を実行するか、データセットをアップロードしてクラウド評価を実行するには、データセットを .jsonl 形式で表す必要があります。前の会話は、 .jsonl ファイル内の次のようなデータセット行に相当します。

{"conversation":
    {
        "messages": [
        {
            "content": "Which tent is the most waterproof?", 
            "role": "user"
        },
        {
            "content": "The Alpine Explorer Tent is the most waterproof",
            "role": "assistant", 
            "context": "From the our product list the alpine explorer tent is the most waterproof. The Adventure Dining Table has higher weight."
        },
        {
            "content": "How much does it cost?",
            "role": "user"
        },
        {
            "content": "The Alpine Explorer Tent is $120.",
            "role": "assistant",
            "context": null
        }
        ]
    }
}

エバリュエータは、会話の最初のターンで query から有効な user、context から assistant、response から assistant がクエリと応答の形式で提供されることを理解します。会話は次に、ターンごとに評価され、結果は会話スコアのすべてのターンで集計されます。

注

2 番目のターンでは、 context が null または不足しているキーであっても、エラーが発生するのではなく空の文字列として解釈されるため、誤解を招く可能性があります。データ要件に準拠するよう、評価データの有効性を検証することを強くお勧めします。

会話モードの場合、GroundednessEvaluator の例は次のようになります。

# Conversation mode
import json
import os
from azure.ai.evaluation import GroundednessEvaluator, AzureOpenAIModelConfiguration

model_config = AzureOpenAIModelConfiguration(
    azure_endpoint=os.environ.get("AZURE_ENDPOINT"),
    api_key=os.environ.get("AZURE_API_KEY"),
    azure_deployment=os.environ.get("AZURE_DEPLOYMENT_NAME"),
    api_version=os.environ.get("AZURE_API_VERSION"),
)

# Initializing Groundedness and Groundedness Pro evaluators
groundedness_eval = GroundednessEvaluator(model_config)

conversation = {
    "messages": [
        { "content": "Which tent is the most waterproof?", "role": "user" },
        { "content": "The Alpine Explorer Tent is the most waterproof", "role": "assistant", "context": "From the our product list the alpine explorer tent is the most waterproof. The Adventure Dining Table has higher weight." },
        { "content": "How much does it cost?", "role": "user" },
        { "content": "$120.", "role": "assistant", "context": "The Alpine Explorer Tent is $120."}
    ]
}

# alternatively, you can load the same content from a .jsonl file
groundedness_conv_score = groundedness_eval(conversation=conversation)
print(json.dumps(groundedness_conv_score, indent=4))

会話の出力の場合、ターンごとの結果は一覧に格納され、会話の全体的なスコア 'groundedness': 4.0 はターン全体で平均されます。

{
    "groundedness": 5.0,
    "gpt_groundedness": 5.0,
    "groundedness_threshold": 3.0,
    "evaluation_per_turn": {
        "groundedness": [
            5.0,
            5.0
        ],
        "gpt_groundedness": [
            5.0,
            5.0
        ],
        "groundedness_reason": [
            "The response accurately and completely answers the query by stating that the Alpine Explorer Tent is the most waterproof, which is directly supported by the context. There are no irrelevant details or incorrect information present.",
            "The RESPONSE directly answers the QUERY with the exact information provided in the CONTEXT, making it fully correct and complete."
        ],
        "groundedness_result": [
            "pass",
            "pass"
        ],
        "groundedness_threshold": [
            3,
            3
        ]
    }
}

注

コードでサポートされるエバリュエータモデルを増やすため、プレフィックスのないキー (たとえば、groundedness.groundedness) が使用されるよう、コードを移行することをユーザーに強くお勧めします。

画像およびマルチモーダルの画像とテキストの会話をサポートするエバリュエーターの場合は、画像の URL または base64 でエンコードされた画像を conversation で渡すことができます。

サポートされているシナリオの例を次に示します。

画像またはテキスト生成に対する複数の画像とテキスト入力
画像生成に対するテキストのみの入力
テキスト生成に対する画像のみの入力

from pathlib import Path
from azure.ai.evaluation import ContentSafetyEvaluator
import base64

# instantiate an evaluator with image and multi-modal support
safety_evaluator = ContentSafetyEvaluator(credential=azure_cred, azure_ai_project=project_scope)

# example of a conversation with an image URL
conversation_image_url = {
    "messages": [
        {
            "role": "system",
            "content": [
                {"type": "text", "text": "You are an AI assistant that understands images."}
            ],
        },
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Can you describe this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://cdn.britannica.com/68/178268-050-5B4E7FB6/Tom-Cruise-2013.jpg"
                    },
                },
            ],
        },
        {
            "role": "assistant",
            "content": [
                {
                    "type": "text",
                    "text": "The image shows a man with short brown hair smiling, wearing a dark-colored shirt.",
                }
            ],
        },
    ]
}

# example of a conversation with base64 encoded images
base64_image = ""

with Path.open("Image1.jpg", "rb") as image_file:
    base64_image = base64.b64encode(image_file.read()).decode("utf-8")

conversation_base64 = {
    "messages": [
        {"content": "create an image of a branded apple", "role": "user"},
        {
            "content": [{"type": "image_url", "image_url": {"url": f"data:image/jpg;base64,{base64_image}"}}],
            "role": "assistant",
        },
    ]
}

# run the evaluation on the conversation to output the result
safety_score = safety_evaluator(conversation=conversation_image_url)

現在、画像とマルチモーダルのエバリュエーターでは次のものがサポートされています。

1 ターンのみ (会話は 1 つのユーザーメッセージと 1 つのアシスタントメッセージのみを含むことができます)
会話はシステムメッセージは 1 つだけ含むことができます
会話のペイロードのサイズは (画像を含め) 10 MB 未満にする必要があります
絶対 URL と Base64 でエンコードされた画像
1 回のターンで複数の画像
JPG/JPEG、PNG、GIF ファイル形式

セットアップ

GroundednessProEvaluator (プレビュー) を除く AI 支援型の品質エバリュエータの場合、評価データにスコアを付ける審判として機能する GPT モデル (gpt-35-turbo、gpt-4、gpt-4-turbo、gpt-4o、または gpt-4o-mini) を model_config で指定する必要があります。 Azure OpenAI と OpenAI の両方のモデル構成スキーマがサポートされています。これらのエバリュエーターで最適なパフォーマンスと解析可能な応答を得るには、プレビュー段階ではない GPT モデルを使用することをお勧めします。

注

エバリュエーターモデルでは、gpt-3.5-turbo を gpt-4o-mini に置き換えることを強くお勧めします。OpenAI によれば、後者のほうが安価で高機能であり、速度は同等だからです。

API キーを使用して推論呼び出しを行うために、Azure OpenAI リソースに少なくとも Cognitive Services OpenAI User ロールがあることを確認します。アクセス許可の詳細については、Azure OpenAI リソースのアクセス許可に関する記事を参照してください。

すべてのリスクおよびセーフティエバリュエーターと GroundednessProEvaluator (プレビュー) では、model_config の GPT デプロイではなく、azure_ai_project 情報を提供する必要があります。これにより、Azure AI プロジェクトを介してバックエンド評価サービスにアクセスします。

AI 支援組み込みエバリュエーターのプロンプト

安全エバリュエーターと GroundednessProEvaluator (Azure AI Content Safety を利用) を除き、エバリュエーターライブラリと Azure AI Evaluation Python SDK リポジトリで品質評価者のプロンプトをオープンソース化して透明性を確保しています。これらのプロンプトは、評価タスクを実行するための言語モデルの指示として機能します。これには、メトリックとそれに関連するスコアリングルーブリックの人間にわかりやすい定義が必要です。ユーザーがシナリオの詳細に合わせて定義とスコアリング指示書きをカスタマイズすることを強くお勧めします。詳しくは、カスタムエバリュエータに関するページをご覧ください。

複合評価者

複合エバリュエータは、個々の品質または安全性のメトリックを組み合わせて、クエリ応答ペアまたはチャットメッセージの両方に対してすぐに使用できる幅広いメトリックを簡単に提供できる組み込みエバリュエータです。

複合エバリュエータ	次のものを含む	説明
`QAEvaluator`	`GroundednessEvaluator`、 `RelevanceEvaluator`、 `CoherenceEvaluator`、 `FluencyEvaluator`、 `SimilarityEvaluator`、 `F1ScoreEvaluator`	クエリと応答ペア向けに組み合わされたメトリックの単一の出力用に、すべての品質エバリュエータを組み合わせます
`ContentSafetyEvaluator`	`ViolenceEvaluator`、 `SexualEvaluator`、 `SelfHarmEvaluator`、 `HateUnfairnessEvaluator`	クエリと応答ペア向けに組み合わされたメトリックの単一の出力用に、すべての安全性エバリュエータを組み合わせます

`evaluate()` を使用したテストデータセットのローカル評価

単一のデータ行で組み込みまたはカスタムのエバリュエーターのスポットチェックを行った後、テストデータセット全体で evaluate() API を使用して複数のエバリュエーターを結合できます。

Azure AI Foundry Projects の前提条件の設定手順

評価を初めて実行して Azure AI Foundry プロジェクトにログ記録する場合は、いくつかの追加のセットアップ手順が必要になる場合があります。

リソースレベルでストレージアカウントを作成し、Azure AI Foundry プロジェクトに接続します。この bicep テンプレートは、キー認証を用いてストレージアカウントを設定し、Foundry プロジェクトに接続します。
接続されているストレージアカウントがすべてのプロジェクトにアクセスできることを確認します。
Microsoft Entra ID でストレージアカウントを接続した場合は、Azure portal でアカウントと Foundry プロジェクトリソースの両方に、ストレージ BLOB データ所有者の MSI (Microsoft ID) アクセス許可を付与してください。

データセットで評価し、結果を Azure AI Foundry に記録する

evaluate() がデータを正しく解析できるようにするには、列マッピングを指定して、データセットの列を、エバリュエーターで受け入れられるキーワードにマップする必要があります。この場合、query、response、context のデータマッピングを指定します。

from azure.ai.evaluation import evaluate

result = evaluate(
    data="data.jsonl", # provide your data here
    evaluators={
        "groundedness": groundedness_eval,
        "answer_length": answer_length
    },
    # column mapping
    evaluator_config={
        "groundedness": {
            "column_mapping": {
                "query": "${data.queries}",
                "context": "${data.context}",
                "response": "${data.response}"
            } 
        }
    },
    # Optionally provide your Azure AI Foundry project information to track your evaluation results in your project portal
    azure_ai_project = azure_ai_project,
    # Optionally provide an output path to dump a json of metric summary, row level data and metric and Azure AI project URL
    output_path="./myevalresults.json"
)

ヒント

リンクの result.studio_url プロパティの内容を取得し、ログされた評価結果を Azure AI プロジェクトで表示します。

エバリュエーターの出力は辞書になります。これには、集計 metrics と行レベルのデータおよびメトリックが含まれます。出力の例を次に示します。

{'metrics': {'answer_length.value': 49.333333333333336,
             'groundedness.gpt_groundeness': 5.0, 'groundedness.groundeness': 5.0},
 'rows': [{'inputs.response': 'Paris is the capital of France.',
           'inputs.context': 'Paris has been the capital of France since '
                                  'the 10th century and is known for its '
                                  'cultural and historical landmarks.',
           'inputs.query': 'What is the capital of France?',
           'outputs.answer_length.value': 31,
           'outputs.groundeness.groundeness': 5,
           'outputs.groundeness.gpt_groundeness': 5,
           'outputs.groundeness.groundeness_reason': 'The response to the query is supported by the context.'},
          {'inputs.response': 'Albert Einstein developed the theory of '
                            'relativity.',
           'inputs.context': 'Albert Einstein developed the theory of '
                                  'relativity, with his special relativity '
                                  'published in 1905 and general relativity in '
                                  '1915.',
           'inputs.query': 'Who developed the theory of relativity?',
           'outputs.answer_length.value': 51,
           'outputs.groundeness.groundeness': 5,
           'outputs.groundeness.gpt_groundeness': 5,
           'outputs.groundeness.groundeness_reason': 'The response to the query is supported by the context.'},
          {'inputs.response': 'The speed of light is approximately 299,792,458 '
                            'meters per second.',
           'inputs.context': 'The exact speed of light in a vacuum is '
                                  '299,792,458 meters per second, a constant '
                                  "used in physics to represent 'c'.",
           'inputs.query': 'What is the speed of light?',
           'outputs.answer_length.value': 66,
           'outputs.groundeness.groundeness': 5,
           'outputs.groundeness.gpt_groundeness': 5,
           'outputs.groundeness.groundeness_reason': 'The response to the query is supported by the context.'}],
 'traces': {}}

`evaluate()` の要件

evaluate() API には、それが受け取るデータ形式と、Azure AI プロジェクトの評価結果グラフを正しく表示するためにエバリュエータパラメーターキー名を処理する方法について要件がいくつかあります。

データ形式

evaluate() API は、JSONLines 形式のデータのみを受け入れます。すべての組み込みエバリュエータについて、evaluate() には、必須の入力フィールドを含む次の形式のデータが必要です。組み込みエバリュエーターに必要なデータ入力に関する前のセクションを参照してください。 1 行のサンプルは次のようになります。

{
  "query":"What is the capital of France?",
  "context":"France is in Europe",
  "response":"Paris is the capital of France.",
  "ground_truth": "Paris"
}

エバリュエータのパラメーター形式

組み込みのエバリュエータを渡す場合は、evaluators パラメーターリストで適切なキーワードマッピングを指定することが重要です。次の表は、組み込みのエバリュエーターからの結果が Azure AI プロジェクトに記録されたときに UI に表示するために必要なキーワードマッピングです。

エバリュエータ	キーワードパラメーター
`GroundednessEvaluator`	"groundedness"
`GroundednessProEvaluator`	グラウンデッドネスプロ
`RetrievalEvaluator`	"retrieval"
`RelevanceEvaluator`	"関連性"
`CoherenceEvaluator`	一貫性
`FluencyEvaluator`	"fluency"
`SimilarityEvaluator`	"類似性"
`F1ScoreEvaluator`	"f1_score"
`RougeScoreEvaluator`	"rouge"
`GleuScoreEvaluator`	"gleu"
`BleuScoreEvaluator`	"bleu"
`MeteorScoreEvaluator`	"meteor"
`ViolenceEvaluator`	"暴力"
`SexualEvaluator`	"sexual"
`SelfHarmEvaluator`	"self_harm"
`HateUnfairnessEvaluator`	不公正を嫌う
`IndirectAttackEvaluator`	"indirect_attack"
`ProtectedMaterialEvaluator`	"protected_material"
`CodeVulnerabilityEvaluator`	コードの脆弱性
`UngroundedAttributesEvaluator`	根拠のない属性
`QAEvaluator`	"qa"
`ContentSafetyEvaluator`	コンテンツの安全性

evaluators パラメーターの設定例を次に示します。

result = evaluate(
    data="data.jsonl",
    evaluators={
        "sexual":sexual_evaluator
        "self_harm":self_harm_evaluator
        "hate_unfairness":hate_unfairness_evaluator
        "violence":violence_evaluator
    }
)

ターゲットのローカル評価

実行して評価するクエリの一覧がある場合、evaluate() は target パラメーターもサポートします。これにより、アプリケーションにクエリを送信して回答を収集し、結果のクエリと応答に対してエバリュエータを実行できます。

ターゲットには、辞書内の任意の呼び出し可能なクラスを指定できます。この場合、ターゲットとして設定できる呼び出し可能なクラスaskwiki()を含む Python スクリプトaskwiki.pyがあります。単純な askwiki アプリに送信できるクエリのデータセットがある場合、出力の根拠性を評価できます。 "column_mapping" でデータに適切な列マッピングを指定してください。 "default" を使用して、すべてのエバリュエータの列マッピングを指定できます。

"data.jsonl" の内容を次に示します。

{"query":"When was United Stated found ?", "response":"1776"}
{"query":"What is the capital of France?", "response":"Paris"}
{"query":"Who is the best tennis player of all time ?", "response":"Roger Federer"}

from askwiki import askwiki

result = evaluate(
    data="data.jsonl",
    target=askwiki,
    evaluators={
        "groundedness": groundedness_eval
    },
    evaluator_config={
        "default": {
            "column_mapping": {
                "query": "${data.queries}"
                "context": "${outputs.context}"
                "response": "${outputs.response}"
            } 
        }
    }
)