评估集（旧版）

2025-06-11

重要

本页介绍<0.22 与 MLflow <2.x 的代理评估。 Databricks 建议使用与代理评估 >1.0集成的 MLflow 3。代理评估 SDK 方法现在通过 mlflow SDK 公开。

若要衡量 AI 代理的质量，需要能够定义一组具有代表性的请求以及描述高质量响应的标准。可以通过提供评估集来实现此目的。本文介绍评估集的各种选项，以及创建评估集的一些最佳做法。

Databricks 建议创建一个人工标记的评估集，其中包括代表性问题和真实答案。如果应用程序包含检索步骤，则可选择提供正常情况下可供响应作为依据的支持文档。为了帮助你开始创建评估集，Databricks 提供了一个 SDK，用于生成高质量的综合问题和基本答案，这些答案可以直接用于代理评估，或发送给主题专家进行评审。请参阅综合生成评估集。

良好的评估集具有以下特征：

有代表性：应该准确反映应用程序在生产环境中会遇到的请求范围。
具有挑战性：它应包括困难和多样化的案例，以有效测试应用程序的全部功能。
持续更新：应定期更新以反映应用程序的使用方式和生产流量的变化模式。

有关评估集的必需架构，请参阅代理评估输入架构（旧版）。

示例评估集

本部分包括评估集的简单示例。

仅包含 `request` 的示例评估集

eval_set = [
    {
        "request": "What is the difference between reduceByKey and groupByKey in Spark?",
    }
]

包含 `request` 和 `expected_response` 的示例评估集

eval_set  = [
    {
        "request_id": "request-id",
        "request": "What is the difference between reduceByKey and groupByKey in Spark?",
        "expected_response": "There's no significant difference.",
    }
]

包含 `request`、`expected_response` 和 `expected_retrieved_content` 的示例评估集

eval_set  = [
    {
        "request_id": "request-id",
        "request": "What is the difference between reduceByKey and groupByKey in Spark?",
        "expected_retrieved_context": [
            {
                "doc_uri": "doc_uri_1",
            },
            {
                "doc_uri": "doc_uri_2",
            },
        ],
        "expected_response": "There's no significant difference.",
    }
]

仅包含 `request` 和 `response` 的示例评估集

eval_set = [
    {
        "request": "What is the difference between reduceByKey and groupByKey in Spark?",
        "response": "reduceByKey aggregates data before shuffling, whereas groupByKey shuffles all data, making reduceByKey more efficient.",
    }
]

所包含的 `request` 和 `response` 可以为任意格式的示例评估集

eval_set = [
    {
        "request": {"query": "Difference between", "item_a": "reduceByKey", "item_b": "groupByKey"},
        "response": {
            "differences": [
                "reduceByKey aggregates data before shuffling",
                "groupByKey shuffles all data",
                "reduceByKey is more efficient",
            ]
        }
    }
]

包含 `request`、`response` 和 `guidelines` 的示例评估集

eval_set = [
    {
        "request": "What is the difference between reduceByKey and groupByKey in Spark?",
        "response": "reduceByKey aggregates data before shuffling, whereas groupByKey shuffles all data, making reduceByKey more efficient.",
        # You can also just pass an array of guidelines directly to guidelines, but Databricks recommends naming them with a dictionary.
        "guidelines": {
            "english": ["The response must be in English"],
            "clarity": ["The response must be clear, coherent, and concise"],
        }
    }
]

示例评估集包含 `request`、`response`、`guidelines` 和 `expected_facts`

eval_set = [
    {
        "request": "What is the difference between reduceByKey and groupByKey in Spark?",
        "response": "reduceByKey aggregates data before shuffling, whereas groupByKey shuffles all data, making reduceByKey more efficient.",
        "expected_facts": [
            "There's no significant difference.",
        ],
        # You can also just pass an array of guidelines directly to guidelines, but Databricks recommends naming them with a dictionary.
        "guidelines": {
            "english": ["The response must be in English"],
            "clarity": ["The response must be clear, coherent, and concise"],
        }
    }
]

包含 `request`、`response` 和 `retrieved_context` 的示例评估集

eval_set = [
    {
        "request_id": "request-id", # optional, but useful for tracking
        "request": "What is the difference between reduceByKey and groupByKey in Spark?",
        "response": "reduceByKey aggregates data before shuffling, whereas groupByKey shuffles all data, making reduceByKey more efficient.",
        "retrieved_context": [
            {
                # In `retrieved_context`, `content` is optional, but delivers additional functionality if provided (the Databricks Context Relevance LLM judge runs to check the relevance of the provided content to the request).
                "content": "reduceByKey reduces the amount of data shuffled by merging values before shuffling.",
                "doc_uri": "doc_uri_2_1",
            },
            {
                "content": "groupByKey may lead to inefficient data shuffling due to sending all values across the network.",
                "doc_uri": "doc_uri_6_extra",
            },
        ],
    }
]

包含 `request`、`response`、`retrieved_context` 和 `expected_facts` 的示例评估集

eval_set  = [
    {
        "request_id": "request-id",
        "request": "What is the difference between reduceByKey and groupByKey in Spark?",
        "expected_facts": [
            "There's no significant difference.",
        ],
        "response": "reduceByKey aggregates data before shuffling, whereas groupByKey shuffles all data, making reduceByKey more efficient.",
        "retrieved_context": [
            {
                # In `retrieved_context`, `content` is optional, but delivers additional functionality if provided (the Databricks Context Relevance LLM judge runs to check the relevance of the provided content to the request).
                "content": "reduceByKey reduces the amount of data shuffled by merging values before shuffling.",
                "doc_uri": "doc_uri_2_1",
            },
            {
                "content": "groupByKey may lead to inefficient data shuffling due to sending all values across the network.",
                "doc_uri": "doc_uri_6_extra",
            },
        ],
    }
]

包含 `request`、`response`、`retrieved_context`、`expected_facts` 和 `expected_retrieved_context` 的示例评估集

eval_set  = [
    {
        "request_id": "request-id",
        "request": "What is the difference between reduceByKey and groupByKey in Spark?",
        "expected_retrieved_context": [
            {
                "doc_uri": "doc_uri_2_1",
            },
            {
                "doc_uri": "doc_uri_2_2",
            },
        ],
        "expected_facts": [
            "There's no significant difference.",
        ],
        "response": "reduceByKey aggregates data before shuffling, whereas groupByKey shuffles all data, making reduceByKey more efficient.",
        "retrieved_context": [
            {
                # In `retrieved_context`, `content` is optional, but delivers additional functionality if provided (the Databricks Context Relevance LLM judge runs to check the relevance of the provided content to the request).
                "content": "reduceByKey reduces the amount of data shuffled by merging values before shuffling.",
                "doc_uri": "doc_uri_2_1",
            },
            {
                "content": "groupByKey may lead to inefficient data shuffling due to sending all values across the network.",
                "doc_uri": "doc_uri_6_extra",
            },
        ],
    }
]

开发评估集的最佳做法

将评估集中的每个样本或样本组视为一个单元测试。也就是说，每个样本都应该对应一个具有显式预期结果的特定方案。例如，考虑测试更长上下文、多跳推理以及从间接证据推理答案的功能。
考虑测试来自恶意用户的对抗方案。
对于评估集中应包含的问题数量，没有具体的指导原则，但来自高质量数据的清晰信号通常比来自弱数据的噪声信号表现更好。
考虑包括一些非常具有挑战性的示例，甚至对于人类而言也难以回答。
无论是生成通用应用程序还是针对特定领域，应用程序都可能会遇到各种各样的问题。评估集应该反映这一点。例如，如果你正在创建一个应用程序来回答特定的 HR 问题，仍应考虑测试其他领域（例如操作），以确保该应用程序不会产生幻觉或提供有害响应。
高质量、一致的人工生成的标签是确保提供给应用程序的真实值准确反映所需行为的最佳方式。确保高质量人工标签的一些步骤如下：
- 聚合多个人工标记工具对同一问题的回答（标签）。
- 确保标记说明清晰，且标记工具一致。
- 确保人工标记过程的条件与提交给 RAG 应用程序的请求的格式相同。
人工标记工具本质上是混杂且不一致的，原因有多种，例如，对问题的解释不同。这是该流程的一个重要部分。使用人工标记可以揭示未曾考虑过的问题的解释，并且可能为在应用程序中观察到的行为提供见解。

通过

评估集（旧版）

示例评估集

仅包含 request 的示例评估集

包含 request 和 expected_response 的示例评估集

包含 request、expected_response 和 expected_retrieved_content 的示例评估集

仅包含 request 和 response 的示例评估集

所包含的 request 和 response 可以为任意格式的示例评估集

包含 request、response 和 guidelines 的示例评估集

示例评估集包含 request、response、guidelines 和 expected_facts

包含 request、response 和 retrieved_context 的示例评估集

包含 request、response、retrieved_context 和 expected_facts 的示例评估集

包含 request、response、retrieved_context、expected_facts 和 expected_retrieved_context 的示例评估集