生成评估集

2025-06-03

本页介绍如何综合生成用于测量代理质量的高质量评估集。

手动生成评估集通常很耗时，并且很难确保它涵盖代理的所有功能。马赛克 AI 代理评估通过从文档中自动生成具有代表性的评估集来消除此障碍，使你能够快速评估具有良好测试用例覆盖率的代理。

生成评估集

若要合成使用文档检索的代理的评估，请使用 generate_evals_df Python 包的 databricks-agents 一部分的方法。有关 API 的详细信息，请参阅 Python SDK 参考。

此方法要求将文档作为 Pandas 数据帧或 Spark 数据帧提供。

输入数据帧必须具有以下列：

content：分析的文档内容作为字符串。
doc_uri：文档 URI。

可以使用三个附加参数来帮助控制生成：

num_evals：在所有文档中要生成的评估总数。该函数会尝试将生成的评估分布到所有文档上，并考虑其大小。如果 num_evals 小于文档数量，则评估集中不会包含所有文档。

有关num_evals如何用于在文档中分发评估的详细信息，请参阅num_evals的使用方法。
agent_description：代理的任务说明
question_guidelines：一组指南，可帮助指导合成问题生成。这是一个自由格式的字符串，用于提示生成。请参阅以下示例。

generate_evals_df的输出为数据框。数据帧中的列取决于使用的是 MLflow 3 还是 MLflow 2。

MLflow 3

request_id：唯一请求 ID。
inputs：聊天完成 API 中的合成输入
expectations：包含两个字段的字典：
- expected_facts：响应中预期事实的列表。此列的数据类型是列表[string]。
- expected_retrieved_context：此评估的合成上下文，包括文档内容和doc_uri。

MLflow 2

request_id：唯一请求 ID。
request：合成的请求。
expected_facts：响应中预期事实的列表。此列的数据类型是列表[string]。
expected_retrieved_context：此评估的合成上下文，包括文档内容和doc_uri。

示例：

以下示例用于 generate_evals_df 生成评估集，然后直接调用 mlflow.evaluate() 以测量 Meta Llama 3.1 在此评估集的性能。 Llama 3.1 模型从未见过你的文档，因此它可能会幻觉。即便如此，此实验仍是您自定义代理的良好基础。


%pip install databricks-agents
dbutils.library.restartPython()

import mlflow
from databricks.agents.evals import generate_evals_df
import pandas as pd
import math

# `docs` can be a Pandas DataFrame or a Spark DataFrame with two columns: 'content' and 'doc_uri'.
docs = pd.DataFrame.from_records(
    [
      {
        'content': f"""
            Apache Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Java,
            Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set
            of higher-level tools including Spark SQL for SQL and structured data processing, pandas API on Spark for pandas
            workloads, MLlib for machine learning, GraphX for graph processing, and Structured Streaming for incremental
            computation and stream processing.
        """,
        'doc_uri': 'https://spark.apache.org/docs/3.5.2/'
      },
      {
        'content': f"""
            Spark's primary abstraction is a distributed collection of items called a Dataset. Datasets can be created from Hadoop InputFormats (such as HDFS files) or by transforming other Datasets. Due to Python's dynamic nature, we don't need the Dataset to be strongly-typed in Python. As a result, all Datasets in Python are Dataset[Row], and we call it DataFrame to be consistent with the data frame concept in Pandas and R.""",
        'doc_uri': 'https://spark.apache.org/docs/3.5.2/quick-start.html'
      }
    ]
)

agent_description = """
The Agent is a RAG chatbot that answers questions about using Spark on Databricks. The Agent has access to a corpus of Databricks documents, and its task is to answer the user's questions by retrieving the relevant docs from the corpus and synthesizing a helpful, accurate response. The corpus covers a lot of info, but the Agent is specifically designed to interact with Databricks users who have questions about Spark. So questions outside of this scope are considered irrelevant.
"""

question_guidelines = """
# User personas
- A developer who is new to the Databricks platform
- An experienced, highly technical Data Scientist or Data Engineer

# Example questions
- what API lets me parallelize operations over rows of a delta table?
- Which cluster settings will give me the best performance when using Spark?

# Additional Guidelines
- Questions should be succinct, and human-like
"""

num_evals = 10

evals = generate_evals_df(
    docs,
    # The total number of evals to generate. The method attempts to generate evals that have full coverage over the documents
    # provided. If this number is less than the number of documents, is less than the number of documents,
    # some documents will not have any evaluations generated. See "How num_evals is used" below for more details.
    num_evals=num_evals,
    # A set of guidelines that help guide the synthetic generation. These are free-form strings that will be used to prompt the generation.
    agent_description=agent_description,
    question_guidelines=question_guidelines
)

display(evals)

# Evaluate the model using the newly generated evaluation set. After the function call completes, click the UI link to see the results. You can use this as a baseline for your agent.
results = mlflow.evaluate(
  model="endpoints:/databricks-meta-llama-3-1-405b-instruct",
  data=evals,
  model_type="databricks-agent"
)

# Note: To use a different model serving endpoint, use the following snippet to define an agent_fn. Then, specify that function using the `model` argument.
# MODEL_SERVING_ENDPOINT_NAME = '...'
# def agent_fn(input):
#   client = mlflow.deployments.get_deploy_client("databricks")
#   return client.predict(endpoint=MODEL_SERVING_ENDPOINT_NAME, inputs=input)

示例输出如下所示。输出列取决于使用的是 MLflow 3 还是 MLflow 2。

MLflow 3

在以下示例输出中，这些列request_idexpectations.expected_retrieved_context未显示。

inputs.messages[0].content	expectations.expected_facts
Apache Spark 中使用的 Spark SQL 是什么？	Spark SQL 用于 Apache Spark 中的 SQL 处理。 Spark SQL 用于 Apache Spark 中的结构化数据处理。
Apache Spark 支持的一些高级工具有哪些，以及它们有哪些用途？	用于 SQL 和结构化数据处理的 Spark SQL。 Spark 上的 pandas API，用于处理 pandas 工作负载。用于机器学习的 MLlib。 GraphX 用于图形处理。结构化流式处理，用于增量计算和流处理。
Spark 中的主要抽象是什么以及如何在 Python 中表示数据集？	Spark 中的主要抽象是数据集。在 Python 中，Spark 的数据集称为数据帧。在 Python 中，数据集表示为 Dataset[Row]。
为什么 Python 中的所有数据集称为 Spark 中的数据帧？	Python 中的数据集称为 Spark 中的数据帧，以保持与数据帧概念的一致性。数据帧概念在 Pandas 和 R 中是标准的。

MLflow 2

在以下示例输出中，这些列request_idexpected_retrieved_context未显示。

请求	预期事实
Apache Spark 中使用的 Spark SQL 是什么？	Spark SQL 用于 Apache Spark 中的 SQL 处理。 Spark SQL 用于 Apache Spark 中的结构化数据处理。
Apache Spark 支持的一些高级工具有哪些，以及它们有哪些用途？	用于 SQL 和结构化数据处理的 Spark SQL。 Spark 上的 pandas API，用于处理 pandas 工作负载。用于机器学习的 MLlib。 GraphX 用于图形处理。结构化流式处理，用于增量计算和流处理。
Spark 中的主要抽象是什么以及如何在 Python 中表示数据集？	Spark 中的主要抽象是数据集。在 Python 中，Spark 的数据集称为数据帧。在 Python 中，数据集表示为 Dataset[Row]。
为什么 Python 中的所有数据集称为 Spark 中的数据帧？	Python 中的数据集称为 Spark 中的数据帧，以保持与数据帧概念的一致性。数据帧概念在 Pandas 和 R 中是标准的。

如何使用`num_evals`

num_evals 是为文档集生成的评估总数。该函数在考虑到文档大小差异时，将这些评估分布到各个文档中。也就是说，它尝试在文档集中每个页面保持大致相同的问题数量。

如果 num_evals 小于总文档数，则某些文档不会生成任何评估。函数返回的数据帧包含一个列，其中包含 source_doc_ids 用于生成计算的列。可以使用此列连接回您的原始数据帧，为跳过的文档生成评估。

为了帮助估算 num_evals 所需的覆盖范围，我们提供了 estimate_synthetic_num_evals 方法。


from databricks.agents.evals import estimate_synthetic_num_evals

num_evals = estimate_synthetic_num_evals(
  docs, # Same docs as before.
  eval_per_x_tokens = 1000 # Generate 1 eval for every x tokens to control the coverage level.
)

创建综合评估集 — 示例笔记本

有关创建综合评估集的示例代码，请参阅以下笔记本。

综合评估示例笔记本

获取笔记本

10 分钟演示以提升客服表现

以下示例笔记本演示如何提高代理的质量。它包括以下步骤：

生成综合评估数据集。
生成和评估基线代理。
比较多个配置（例如不同的提示）和基础模型之间的基线代理，以找到质量、成本和延迟的正确平衡。
将代理部署到 Web UI，以允许利益干系人测试和提供其他反馈。

使用综合数据笔记本提高代理性能

获取笔记本

有关为综合数据提供支持的模型的信息

综合数据可能使用第三方服务来评估生成 AI 应用程序，包括由Microsoft运营的 Azure OpenAI。
对于 Azure OpenAI，Databricks 已选择退出“滥用监视”，因此不会通过 Azure OpenAI 存储任何提示或响应。
对于欧盟（EU）工作区，合成数据使用欧盟中托管的模型。所有其他区域使用托管在美国的模型。
禁用 Azure AI 支持的 AI 辅助功能可防止合成数据服务调用 Azure AI 支持的模型。
发送到综合数据服务的数据不用于任何模型训练。
综合数据旨在帮助客户评估其代理应用程序，不应使用输出来训练、改进或微调 LLM。

通过

生成评估集