创建自定义记分器

2025-06-11

概述

自定义评分器提供最终的灵活性，用于准确定义 GenAI 应用程序的质量测量方式。自定义评分器提供灵活性来定义针对特定业务用例定制的评估指标，无论是基于简单的启发式、高级逻辑还是编程评估。

对以下方案使用自定义评分器：

定义自定义启发式或基于代码的评估指标
自定义如何将应用程序的跟踪数据映射到由 Databricks 研究支持的 LLM 评委以及预定义的 LLM 评分器中。
使用基于提示的 LLM 评分程序的文章，创建一个具有自定义提示文本的 LLM 评估器。
使用自己的 LLM 模型（而不是 Databricks 托管的 LLM 法官模型）进行评估
需要比预定义抽象提供的更多灵活性和控制的其他任何用例

注释

有关自定义记分器接口的详细参考，请参阅记分器概念页或 API 文档。

使用情况概述

自定义评分器是使用 Python 编写的，可让你完全控制评估应用跟踪中的任何数据。单个自定义记分器可以在evaluate(...)框架中用于脱机评估，或传递到create_monitor(...)中用于生产监控。

支持以下输出类型：

通过/未通过字符串："yes" or "no" 字符串值在 UI 中呈现为“通过”或“未通过”。
数值：序号值：整数或浮点数。
布尔值： True 或 False。
反馈对象：返回Feedback具有分数、理由和其他元数据的对象

作为输入，自定义记分器可以访问：

完整的 MLflow 跟踪，包括范围、属性和输出。跟踪作为实例化 mlflow.entities.trace 类传递到自定义评分器。
字典 inputs ，派生自跟踪的输入数据集或 MLflow 后进程。
outputs 的值派生自输入数据集或跟踪。如果提供 predict_fn，那么 outputs 的值将是 predict_fn 的返回结果。
字典 expectations ，派生自 expectations 输入数据集中的字段，或与跟踪关联的评估。

用户可以通过修饰器@scorer来定义自定义评估指标，这些指标可以通过scorers参数或者create_monitor(...)传递到mlflow.genai.evaluate()。

根据下面的签名，使用命名参数调用记分器函数。所有命名参数都是可选的，因此可以使用任意组合。例如，可以定义仅具有 inputs 和 trace 作为参数的记分器，并省略 outputs 和 expectations：

from mlflow.genai.scorers import scorer
from typing import Optional, Any
from mlflow.entities import Feedback

@scorer
def my_custom_scorer(
  *,  # evaluate(...) harness will always call your scorer with named arguments
  inputs: Optional[dict[str, Any]],  # The agent's raw input, parsed from the Trace or dataset, as a Python dict
  outputs: Optional[Any],  # The agent's raw output, parsed from the Trace or
  expectations: Optional[dict[str, Any]],  # The expectations passed to evaluate(data=...), as a Python dict
  trace: Optional[mlflow.entities.Trace] # The app's resulting Trace containing spans and other metadata
) -> int | float | bool | str | Feedback | list[Feedback]

自定义记分器开发方法

在开发指标时，需要快速迭代指标，而不用在每次更改评分器时执行应用程序。为此，建议执行以下步骤：

步骤 1：定义初始指标、应用和评估数据

import mlflow
from mlflow.entities import Trace
from mlflow.genai.scorers import scorer
from typing import Any

@mlflow.trace
def my_app(input_field_name: str):
    return {'output': input_field_name+'_output'}

@scorer
def my_metric() -> int:
    # placeholder return value
    return 1

eval_set = [{'inputs': {'input_field_name': 'test'}}]

步骤 2：使用 `evaluate()` 从您的应用生成跟踪

eval_results = mlflow.genai.evaluate(
    data=eval_set,
    predict_fn=my_app,
    scorers=[dummy_metric]
)

步骤 3：查询和存储生成的跟踪

generated_traces = mlflow.search_traces(run_id=eval_results.run_id)

步骤 4：在迭代您的指标时，传递生成的跟踪作为输入给`evaluate()`。

这个 search_traces 函数返回一个 Pandas 数据帧，您可以将其作为输入数据集直接传递给 evaluate()。这样你可以快速调整你的指标，而无需重新启动应用程序。

@scorer
def my_metric(outputs: Any):
    # Implement the actual metric logic here.
    return outputs == "test_output"

# Note the lack of a predict_fn parameter
mlflow.genai.evaluate(
    data=generated_traces,
    scorers=[my_metric],
)

自定义记分器示例

在本指南中，我们将向你介绍生成自定义评分器的各种方法。

自定义记分器开发

先决条件：创建示例应用程序并获取跟踪的本地副本

在所有方法中，我们使用以下示例应用程序和跟踪副本（使用上述方法提取）。

import mlflow
from openai import OpenAI
from typing import Any
from mlflow.entities import Trace
from mlflow.genai.scorers import scorer

# Enable auto logging for OpenAI
mlflow.openai.autolog()

# Connect to a Databricks LLM via OpenAI using the same credentials as MLflow
# Alternatively, you can use your own OpenAI credentials here
mlflow_creds = mlflow.utils.databricks_utils.get_databricks_host_creds()
client = OpenAI(
    api_key=mlflow_creds.token,
    base_url=f"{mlflow_creds.host}/serving-endpoints"
)

@mlflow.trace
def sample_app(messages: list[dict[str, str]]):
    # 1. Prepare messages for the LLM
    messages_for_llm = [
        {"role": "system", "content": "You are a helpful assistant."},
        *messages,
    ]

    # 2. Call LLM to generate a response
    response = client.chat.completions.create(
        model="databricks-claude-3-7-sonnet",  # This example uses Databricks hosted Claude. If you provide your own OpenAI credentials, replace with a valid OpenAI model e.g., gpt-4o, etc.
        messages=messages_for_llm,
    )
    return response.choices[0].message.content


# Create a list of messages for the LLM to generate a response
eval_dataset = [
    {
        "inputs": {
            "messages": [
                {"role": "user", "content": "How much does a microwave cost?"},
            ]
        },
    },
    {
        "inputs": {
            "messages": [
                {
                    "role": "user",
                    "content": "Can I return the microwave I bought 2 months ago?",
                },
            ]
        },
    },
    {
        "inputs": {
            "messages": [
                {
                    "role": "user",
                    "content": "I'm having trouble with my account.  I can't log in.",
                },
                {
                    "role": "assistant",
                    "content": "I'm sorry to hear that you're having trouble with your account.  Are you using our website or mobile app?",
                },
                {"role": "user", "content": "Website"},
            ]
        },
    },
]


@scorer
def dummy_metric():
    # This scorer is just to help generate initial traces.
    return 1


# Generate initial traces by running the sample_app.
# The results, including traces, are logged to the MLflow experiment defined above.
initial_eval_results = mlflow.genai.evaluate(
    data=eval_dataset, predict_fn=sample_app, scorers=[dummy_metric]
)

generated_traces = mlflow.search_traces(run_id=initial_eval_results.run_id)

运行上述代码后，您在实验中应有三个轨迹。

生成的示例跟踪

示例 1：从追踪中访问数据

访问完整的 MLflow 跟踪对象，以使用各种详细信息（范围、输入、输出、属性、计时）进行精细的指标计算。

注释

先决条件 generated_traces 部分将用作这些示例的输入数据。

此评分器检查跟踪的总执行时间是否在可接受的范围内。

import mlflow
from mlflow.genai.scorers import scorer
from mlflow.entities import Trace, Feedback, SpanType

@scorer
def llm_response_time_good(trace: Trace) -> Feedback:
    # Search particular span type from the trace
    llm_span = trace.search_spans(span_type=SpanType.CHAT_MODEL)[0]

    response_time = (llm_span.end_time_ns - llm_span.start_time_ns) / 1e9 # second
    max_duration = 5.0
    if response_time <= max_duration:
        return Feedback(
            value="yes",
            rationale=f"LLM response time {response_time:.2f}s is within the {max_duration}s limit."
        )
    else:
        return Feedback(
            value="no",
            rationale=f"LLM response time {response_time:.2f}s exceeds the {max_duration}s limit."
        )

# Evaluate the scorer using the pre-generated traces from the prerequisite code block.
span_check_eval_results = mlflow.genai.evaluate(
    data=generated_traces,
    scorers=[llm_response_time_good]
)

示例 2：封装预定义 LLM 评估器

创建一个自定义评分器，用于封装 MLflow 的预定义 LLM 评估器。用它来预处理法官的跟踪数据，或者后处理反馈。

此示例演示如何包装 is_context_relevant 评估给定上下文是否与查询相关的法官，以评估助手的响应是否与用户的查询相关。

import mlflow
from mlflow.entities import Trace, Feedback
from mlflow.genai.judges import is_context_relevant
from mlflow.genai.scorers import scorer
from typing import Any

# Assume `generated_traces` is available from the prerequisite code block.

@scorer
def is_message_relevant(inputs: dict[str, Any], outputs: str) -> Feedback:
    # The `inputs` field for `sample_app` is a dictionary like: {"messages": [{"role": ..., "content": ...}, ...]}
    # We need to extract the content of the last user message to pass to the relevance judge.

    last_user_message_content = None
    if "messages" in inputs and isinstance(inputs["messages"], list):
        for message in reversed(inputs["messages"]):
            if message.get("role") == "user" and "content" in message:
                last_user_message_content = message["content"]
                break

    if not last_user_message_content:
        raise Exception("Could not extract the last user message from inputs to evaluate relevance.")

    # Call the `relevance_to_query judge. It will return a Feedback object.
    return is_context_relevant(
        request=last_user_message_content,
        context={"response": outputs},
    )

# Evaluate the custom relevance scorer
custom_relevance_eval_results = mlflow.genai.evaluate(
    data=generated_traces,
    scorers=[is_message_relevant]
)

示例 3：使用 `expectations`

在使用包含字典列表或 Pandas 数据帧的参数调用 mlflow.genai.evaluate() 时，每行可以包含一个 expectations 键。与此密钥关联的值直接传递给自定义评分器。

import mlflow
from mlflow.entities import Feedback
from mlflow.genai.scorers import scorer
from typing import Any, List, Optional, Union

expectations_eval_dataset_list = [
    {
        "inputs": {"messages": [{"role": "user", "content": "What is 2+2?"}]},
        "expectations": {
            "expected_response": "2+2 equals 4.",
            "expected_keywords": ["4", "four", "equals"],
        }
    },
    {
        "inputs": {"messages": [{"role": "user", "content": "Describe MLflow in one sentence."}]},
        "expectations": {
            "expected_response": "MLflow is an open-source platform to streamline machine learning development, including tracking experiments, packaging code into reproducible runs, and sharing and deploying models.",
            "expected_keywords": ["mlflow", "open-source", "platform", "machine learning"],
        }
    },
    {
        "inputs": {"messages": [{"role": "user", "content": "Say hello."}]},
        "expectations": {
            "expected_response": "Hello there!",
            # No keywords needed for this one, but the field can be omitted or empty
        }
    }
]

示例 3.1：与预期响应完全匹配

此记分器检查助手的响应是否与expected_response中提供的expectations完全匹配。

@scorer
def exact_match(outputs: str, expectations: dict[str, Any]) -> bool:
    # Scorer can return primitive value like bool, int, float, str, etc.
    return outputs == expectations["expected_response"]

exact_match_eval_results = mlflow.genai.evaluate(
    data=expectations_eval_dataset_list,
    predict_fn=sample_app, # sample_app is from the prerequisite section
    scorers=[exact_match]
)

示例 3.2：预期中的关键字状态检查

此评分器检查助手的响应中是否存在所有expected_keywordsexpectations内容。

@scorer
def keyword_presence_scorer(outputs: str, expectations: dict[str, Any]) -> Feedback:
    expected_keywords = expectations.get("expected_keywords")
    print(expected_keywords)
    if expected_keywords is None:
        return Feedback(
            score=None, # Undetermined, as no keywords were expected
            rationale="No 'expected_keywords' provided in expectations."
        )

    missing_keywords = []
    for keyword in expected_keywords:
        if keyword.lower() not in outputs.lower():
            missing_keywords.append(keyword)

    if not missing_keywords:
        return Feedback(value="yes", rationale="All expected keywords are present in the response.")
    else:
        return Feedback(value="no", rationale=f"Missing keywords: {', '.join(missing_keywords)}.")

keyword_presence_eval_results = mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=sample_app, # sample_app is from the prerequisite section
    scorers=[keyword_presence_scorer]
)

示例 4：返回多个反馈对象

单个评分器可以返回对象列表 Feedback ，允许一个评分者同时评估多个质量方面（例如 PII、情绪、简洁性）。理想情况下，每个 Feedback 对象都应拥有一个唯一的 name（这将在结果中成为指标名称）；否则，如果名称是自动生成的且发生冲突，它们可能会相互覆盖。如果未提供名称，MLflow 将尝试基于记分器函数名称和索引生成一个名称。

此示例演示了一个计分器，它为每个跟踪返回两个不同的反馈片段：

is_not_empty_check：一个布尔值，指示响应内容是否为非空。
response_char_length：响应字符长度的数值。

import mlflow
from mlflow.genai.scorers import scorer
from mlflow.entities import Feedback, Trace # Ensure Feedback and Trace are imported
from typing import Any, Optional

# Assume `generated_traces` is available from the prerequisite code block.

@scorer
def comprehensive_response_checker(outputs: str) -> list[Feedback]:
    feedbacks = []
    # 1. Check if the response is not empty
    feedbacks.append(
        Feedback(name="is_not_empty_check", value="yes" if outputs != "" else "no")
    )
    # 2. Calculate response character length
    char_length = len(outputs)
    feedbacks.append(Feedback(name="response_char_length", value=char_length))
    return feedbacks

multi_feedback_eval_results = mlflow.genai.evaluate(
    data=generated_traces,
    scorers=[comprehensive_response_checker]
)

结果将有两列：is_not_empty_check 和 response_char_length 作为评估。

多类型反馈结果

示例 5：为法官使用自己的 LLM

在评分器中集成自定义或外部托管的 LLM。评分器负责处理 API 调用、输入/输出格式，并从您的 LLM 响应中生成 Feedback，从而对评分过程进行完全控制。

还可以设置 source 对象中的 Feedback 字段，以指示评估的来源是 LLM 评判。

import mlflow
import json
from mlflow.genai.scorers import scorer
from mlflow.entities import AssessmentSource, AssessmentSourceType, Feedback
from typing import Any, Optional


# Assume `generated_traces` is available from the prerequisite code block.
# Assume `client` (OpenAI SDK client configured for Databricks) is available from the prerequisite block.
# client = OpenAI(...)

# Define the prompts for the Judge LLM.
judge_system_prompt = """
You are an impartial AI assistant responsible for evaluating the quality of a response generated by another AI model.
Your evaluation should be based on the original user query and the AI's response.
Provide a quality score as an integer from 1 to 5 (1=Poor, 2=Fair, 3=Good, 4=Very Good, 5=Excellent).
Also, provide a brief rationale for your score.

Your output MUST be a single valid JSON object with two keys: "score" (an integer) and "rationale" (a string).
Example:
{"score": 4, "rationale": "The response was mostly accurate and helpful, addressing the user's query directly."}
"""
judge_user_prompt = """
Please evaluate the AI's Response below based on the Original User Query.

Original User Query:
```{user_query}```

AI's Response:
```{llm_response_from_app}```

Provide your evaluation strictly as a JSON object with "score" and "rationale" keys.
"""

@scorer
def answer_quality(inputs: dict[str, Any], outputs: str) -> Feedback:
    user_query = inputs["messages"][-1]["content"]

    # Call the Judge LLM using the OpenAI SDK client.
    judge_llm_response_obj = client.chat.completions.create(
        model="databricks-claude-3-7-sonnet",  # This example uses Databricks hosted Claude. If you provide your own OpenAI credentials, replace with a valid OpenAI model e.g., gpt-4o-mini, etc.
        messages=[
            {"role": "system", "content": judge_system_prompt},
            {"role": "user", "content": judge_user_prompt.format(user_query=user_query, llm_response_from_app=outputs)},
        ],
        max_tokens=200,  # Max tokens for the judge's rationale
        temperature=0.0, # For more deterministic judging
    )
    judge_llm_output_text = judge_llm_response_obj.choices[0].message.content

    # Parse the Judge LLM's JSON output.
    judge_eval_json = json.loads(judge_llm_output_text)
    parsed_score = int(judge_eval_json["score"])
    parsed_rationale = judge_eval_json["rationale"]

    return Feedback(
        value=parsed_score,
        rationale=parsed_rationale,
        # Set the source of the assessment to indicate the LLM judge used to generate the feedback
        source=AssessmentSource(
            source_type=AssessmentSourceType.LLM_JUDGE,
            source_id="claude-3-7-sonnet",
        )
    )


# Evaluate the scorer using the pre-generated traces.
custom_llm_judge_eval_results = mlflow.genai.evaluate(
    data=generated_traces,
    scorers=[answer_quality]
)

通过在 UI 中打开跟踪并单击“answer_quality”查看评估结果，可以查看评委的元数据，例如理由、时间戳、评估模型名称等。如果评估不正确，可以单击 Edit 按钮修改评分。

新的评估将取代原始法官评估，但将保留编辑历史记录以供将来参考。

编辑 LLM 法官评估

后续步骤

继续您的旅程，并参考这些推荐的行动和教程。

使用自定义 LLM 评分器进行评估 - 使用 LLM 创建语义评估
在生产环境中运行记分器 - 部署记分器进行持续监视
生成评估数据集 - 为评分者创建测试数据

参考指南

浏览本指南中提到的概念和功能的详细文档。

记分器 - 深入了解评分人员的工作原理及其体系结构
评估工具 - 了解mlflow.genai.evaluate()如何使用你的评分器
LLM 评委 - 了解 AI 驱动的评估的基础

通过