評価セット (MLflow 2)

2025-06-11

Von Bedeutung

このページでは、MLflow 2 でのエージェント評価バージョン 0.22 の使用方法について説明します。 Databricks では、エージェント評価 >1.0と統合された MLflow 3 を使用することをお勧めします。 MLflow 3 では、エージェント評価 API が mlflow パッケージの一部になりました。

このトピックの詳細については、「 MLflow 評価データセットの構築」を参照してください。

AI エージェントの品質を測定するには、高品質の応答を特徴付ける基準と共に、代表的な要求セットを定義できる必要があります。これは、評価セットを指定することで行います。この記事では、評価セットのさまざまなオプションと、評価セットを作成するためのベストプラクティスについて説明します。

Databricks では、代表的な質問と事実に関する回答で構成される、人間がラベル付けした評価セットを作成することをお勧めします。アプリケーションに取得手順が含まれている場合は、必要に応じて、応答の基になる必要があるサポートドキュメントを提供できます。評価セットの作成を開始するために、Databricks には、エージェント評価で直接使用できる、またはレビューのために主題の専門家に送信できる、高品質の合成質問と実地の回答を生成する SDK が用意されています。「合成」評価セットを参照してください。

適切な評価セットには、次の特性を持ちます。

代表的である: アプリケーションが運用環境で遭遇する要求の範囲を正確に反映する必要があります。
難しい:アプリケーションの機能の全範囲を効果的にテストするには、困難で多様なケースを含める必要があります。
継続的な更新: アプリケーションの使用状況と運用トラフィックのパターン変動を反映するように、定期的に更新する必要があります。

評価セットに必要なスキーマについては、エージェント評価入力スキーマ (MLflow 2) を参照してください。

評価セットのサンプル

このセクションでは、評価セットの簡単な例を示します。

`request` のみのサンプル評価セット

eval_set = [
    {
        "request": "What is the difference between reduceByKey and groupByKey in Spark?",
    }
]

`request` と `expected_response` を使用したサンプル評価セット

eval_set  = [
    {
        "request_id": "request-id",
        "request": "What is the difference between reduceByKey and groupByKey in Spark?",
        "expected_response": "There's no significant difference.",
    }
]

`request`、`expected_response`、および `expected_retrieved_content` を使用したサンプル評価セット

eval_set  = [
    {
        "request_id": "request-id",
        "request": "What is the difference between reduceByKey and groupByKey in Spark?",
        "expected_retrieved_context": [
            {
                "doc_uri": "doc_uri_1",
            },
            {
                "doc_uri": "doc_uri_2",
            },
        ],
        "expected_response": "There's no significant difference.",
    }
]

`request` と `response` のみを使用したサンプル評価セット

eval_set = [
    {
        "request": "What is the difference between reduceByKey and groupByKey in Spark?",
        "response": "reduceByKey aggregates data before shuffling, whereas groupByKey shuffles all data, making reduceByKey more efficient.",
    }
]

任意の形式の `request` と `response` を使用したサンプル評価セット

eval_set = [
    {
        "request": {"query": "Difference between", "item_a": "reduceByKey", "item_b": "groupByKey"},
        "response": {
            "differences": [
                "reduceByKey aggregates data before shuffling",
                "groupByKey shuffles all data",
                "reduceByKey is more efficient",
            ]
        }
    }
]

`request`、`response`、および `guidelines` を使用したサンプル評価セット

eval_set = [
    {
        "request": "What is the difference between reduceByKey and groupByKey in Spark?",
        "response": "reduceByKey aggregates data before shuffling, whereas groupByKey shuffles all data, making reduceByKey more efficient.",
        # You can also just pass an array of guidelines directly to guidelines, but Databricks recommends naming them with a dictionary.
        "guidelines": {
            "english": ["The response must be in English"],
            "clarity": ["The response must be clear, coherent, and concise"],
        }
    }
]

評価セットのサンプルには、`request`、`response`、`guidelines`、および`expected_facts`が含まれています。

eval_set = [
    {
        "request": "What is the difference between reduceByKey and groupByKey in Spark?",
        "response": "reduceByKey aggregates data before shuffling, whereas groupByKey shuffles all data, making reduceByKey more efficient.",
        "expected_facts": [
            "There's no significant difference.",
        ],
        # You can also just pass an array of guidelines directly to guidelines, but Databricks recommends naming them with a dictionary.
        "guidelines": {
            "english": ["The response must be in English"],
            "clarity": ["The response must be clear, coherent, and concise"],
        }
    }
]

`request`、`response`、および `retrieved_context` を使用したサンプル評価セット

eval_set = [
    {
        "request_id": "request-id", # optional, but useful for tracking
        "request": "What is the difference between reduceByKey and groupByKey in Spark?",
        "response": "reduceByKey aggregates data before shuffling, whereas groupByKey shuffles all data, making reduceByKey more efficient.",
        "retrieved_context": [
            {
                # In `retrieved_context`, `content` is optional, but delivers additional functionality if provided (the Databricks Context Relevance LLM judge runs to check the relevance of the provided content to the request).
                "content": "reduceByKey reduces the amount of data shuffled by merging values before shuffling.",
                "doc_uri": "doc_uri_2_1",
            },
            {
                "content": "groupByKey may lead to inefficient data shuffling due to sending all values across the network.",
                "doc_uri": "doc_uri_6_extra",
            },
        ],
    }
]

`request`、`response`、`retrieved_context`、および `expected_facts` を使用したサンプル評価セット

eval_set  = [
    {
        "request_id": "request-id",
        "request": "What is the difference between reduceByKey and groupByKey in Spark?",
        "expected_facts": [
            "There's no significant difference.",
        ],
        "response": "reduceByKey aggregates data before shuffling, whereas groupByKey shuffles all data, making reduceByKey more efficient.",
        "retrieved_context": [
            {
                # In `retrieved_context`, `content` is optional, but delivers additional functionality if provided (the Databricks Context Relevance LLM judge runs to check the relevance of the provided content to the request).
                "content": "reduceByKey reduces the amount of data shuffled by merging values before shuffling.",
                "doc_uri": "doc_uri_2_1",
            },
            {
                "content": "groupByKey may lead to inefficient data shuffling due to sending all values across the network.",
                "doc_uri": "doc_uri_6_extra",
            },
        ],
    }
]

`request`、`response`、`retrieved_context`、`expected_facts`、および `expected_retrieved_context` を使用したサンプル評価セット

eval_set  = [
    {
        "request_id": "request-id",
        "request": "What is the difference between reduceByKey and groupByKey in Spark?",
        "expected_retrieved_context": [
            {
                "doc_uri": "doc_uri_2_1",
            },
            {
                "doc_uri": "doc_uri_2_2",
            },
        ],
        "expected_facts": [
            "There's no significant difference.",
        ],
        "response": "reduceByKey aggregates data before shuffling, whereas groupByKey shuffles all data, making reduceByKey more efficient.",
        "retrieved_context": [
            {
                # In `retrieved_context`, `content` is optional, but delivers additional functionality if provided (the Databricks Context Relevance LLM judge runs to check the relevance of the provided content to the request).
                "content": "reduceByKey reduces the amount of data shuffled by merging values before shuffling.",
                "doc_uri": "doc_uri_2_1",
            },
            {
                "content": "groupByKey may lead to inefficient data shuffling due to sending all values across the network.",
                "doc_uri": "doc_uri_6_extra",
            },
        ],
    }
]

評価セットを開発するためのベストプラクティス

評価セット内の各サンプル (サンプルのグループ) を単体テストとして考えます。つまり、各サンプルは、明示的に期待される結果を持つ特定のシナリオに対応している必要があります。たとえば、長いコンテキスト、マルチホップ推論、間接的な証拠から回答を推論する機能をテストするなどを検討してください。
悪意のあるユーザーからの敵対的なシナリオをテストすることを検討してください。
評価セットに含める質問の数に関する具体的なガイドラインはありませんが、高品質のデータからの明確な入力は、通常、弱いデータからのノイズの多い入力よりも優れたパフォーマンスを発揮します。
人間が答える場合でも、非常に答えにくい例を含めてみてください。
汎用アプリケーションを構築する場合でも、特定のドメインをターゲットにしている場合でも、アプリではさまざまな質問を受ける可能性があります。評価セットには、それが反映されている必要があります。たとえば、特定の人事の質問に対応するアプリケーションを作成する場合でも、アプリケーションが見当はずれな回答や有害な応答を提供しないように、他のドメイン (運営など) をテストすることを検討する必要があります。
高品質で一貫性のある人間が生成したラベルは、アプリケーションに提供するグラウンドトゥルースが目的の動作を正確に反映するようにするための最良の方法です。高品質の人間のラベルを確保するための手順には次のものがあります。
- 同じ質問に対する複数の人間のラベラーからの応答 (ラベル) を集約します。
- ラベル付け手順が明確であり、人間のラベラーに一貫性があるようにします。
- 人間が行うラベル付けプロセスの条件が、RAG アプリケーションに送信された要求の形式と同じであるようにします。
人間のラベラーは、本質的にノイズが多く、一貫性がありません。これは、たとえば、質問の解釈が異なることなどに起因します。これはプロセスの重要な部分です。人間によるラベル付けを使用すると、想定外の質問の解釈が明らかになり、アプリケーションで観察される動作に関する分析情報が得られる可能性があります。

次の方法で共有

評価セット (MLflow 2)

評価セットのサンプル

request のみのサンプル評価セット

request と expected_response を使用したサンプル評価セット

request、expected_response、および expected_retrieved_content を使用したサンプル評価セット

request と response のみを使用したサンプル評価セット

任意の形式の request と response を使用したサンプル評価セット

request、response、および guidelines を使用したサンプル評価セット

評価セットのサンプルには、request、response、guidelines、およびexpected_factsが含まれています。

request、response、および retrieved_context を使用したサンプル評価セット

request、response、retrieved_context、および expected_facts を使用したサンプル評価セット

request、response、retrieved_context、expected_facts、および expected_retrieved_context を使用したサンプル評価セット

評価セットを開発するためのベスト プラクティス

フィードバック

その他のリソース