概述
自定义评分器提供最终的灵活性,用于准确定义 GenAI 应用程序的质量测量方式。 自定义评分器提供灵活性来定义针对特定业务用例定制的评估指标,无论是基于简单的启发式、高级逻辑还是编程评估。
对以下方案使用自定义评分器:
- 定义自定义启发式或基于代码的评估指标
- 自定义如何将应用程序的跟踪数据映射到由 Databricks 研究支持的 LLM 评委以及预定义的 LLM 评分器中。
- 使用 基于提示的 LLM 评分程序 的文章,创建一个具有自定义提示文本的 LLM 评估器。
- 使用自己的 LLM 模型(而不是 Databricks 托管的 LLM 法官模型)进行评估
- 需要比预定义抽象提供的更多灵活性和控制的其他任何用例
注释
有关自定义记分器接口的详细参考,请参阅 记分器概念页 或 API 文档。
使用情况概述
自定义评分器是使用 Python 编写的,可让你完全控制评估应用跟踪中的任何数据。 单个自定义记分器可以在evaluate(...)
框架中用于脱机评估,或传递到create_monitor(...)
中用于生产监控。
支持以下输出类型:
- 通过/未通过字符串:
"yes" or "no"
字符串值在 UI 中呈现为“通过”或“未通过”。 - 数值:序号值:整数或浮点数。
- 布尔值:
True
或False
。 - 反馈对象:返回
Feedback
具有分数、理由和其他元数据的对象
作为输入,自定义记分器可以访问:
- 完整的 MLflow 跟踪,包括范围、属性和输出。 跟踪作为实例化
mlflow.entities.trace
类传递到自定义评分器。 - 字典
inputs
,派生自跟踪的 输入数据集 或 MLflow 后进程。 -
outputs
的值派生自输入数据集或跟踪。 如果提供predict_fn
,那么outputs
的值将是predict_fn
的返回结果。 - 字典
expectations
,派生自expectations
输入数据集中的字段,或与跟踪关联的评估。
用户可以通过修饰器@scorer
来定义自定义评估指标,这些指标可以通过scorers
参数或者create_monitor(...)
传递到mlflow.genai.evaluate()
。
根据下面的签名,使用命名参数调用记分器函数。 所有命名参数都是可选的,因此可以使用任意组合。 例如,可以定义仅具有 inputs
和 trace
作为参数的记分器,并省略 outputs
和 expectations
:
from mlflow.genai.scorers import scorer
from typing import Optional, Any
from mlflow.entities import Feedback
@scorer
def my_custom_scorer(
*, # evaluate(...) harness will always call your scorer with named arguments
inputs: Optional[dict[str, Any]], # The agent's raw input, parsed from the Trace or dataset, as a Python dict
outputs: Optional[Any], # The agent's raw output, parsed from the Trace or
expectations: Optional[dict[str, Any]], # The expectations passed to evaluate(data=...), as a Python dict
trace: Optional[mlflow.entities.Trace] # The app's resulting Trace containing spans and other metadata
) -> int | float | bool | str | Feedback | list[Feedback]
自定义记分器开发方法
在开发指标时,需要快速迭代指标,而不用在每次更改评分器时执行应用程序。 为此,建议执行以下步骤:
步骤 1:定义初始指标、应用和评估数据
import mlflow
from mlflow.entities import Trace
from mlflow.genai.scorers import scorer
from typing import Any
@mlflow.trace
def my_app(input_field_name: str):
return {'output': input_field_name+'_output'}
@scorer
def my_metric() -> int:
# placeholder return value
return 1
eval_set = [{'inputs': {'input_field_name': 'test'}}]
步骤 2:使用 evaluate()
从您的应用生成跟踪
eval_results = mlflow.genai.evaluate(
data=eval_set,
predict_fn=my_app,
scorers=[dummy_metric]
)
步骤 3:查询和存储生成的跟踪
generated_traces = mlflow.search_traces(run_id=eval_results.run_id)
步骤 4:在迭代您的指标时,传递生成的跟踪作为输入给evaluate()
。
这个 search_traces
函数返回一个 Pandas 数据帧,您可以将其作为输入数据集直接传递给 evaluate()
。 这样你可以快速调整你的指标,而无需重新启动应用程序。
@scorer
def my_metric(outputs: Any):
# Implement the actual metric logic here.
return outputs == "test_output"
# Note the lack of a predict_fn parameter
mlflow.genai.evaluate(
data=generated_traces,
scorers=[my_metric],
)
自定义记分器示例
在本指南中,我们将向你介绍生成自定义评分器的各种方法。
先决条件:创建示例应用程序并获取跟踪的本地副本
在所有方法中,我们使用以下示例应用程序和跟踪副本(使用 上述方法提取)。
import mlflow
from openai import OpenAI
from typing import Any
from mlflow.entities import Trace
from mlflow.genai.scorers import scorer
# Enable auto logging for OpenAI
mlflow.openai.autolog()
# Connect to a Databricks LLM via OpenAI using the same credentials as MLflow
# Alternatively, you can use your own OpenAI credentials here
mlflow_creds = mlflow.utils.databricks_utils.get_databricks_host_creds()
client = OpenAI(
api_key=mlflow_creds.token,
base_url=f"{mlflow_creds.host}/serving-endpoints"
)
@mlflow.trace
def sample_app(messages: list[dict[str, str]]):
# 1. Prepare messages for the LLM
messages_for_llm = [
{"role": "system", "content": "You are a helpful assistant."},
*messages,
]
# 2. Call LLM to generate a response
response = client.chat.completions.create(
model="databricks-claude-3-7-sonnet", # This example uses Databricks hosted Claude. If you provide your own OpenAI credentials, replace with a valid OpenAI model e.g., gpt-4o, etc.
messages=messages_for_llm,
)
return response.choices[0].message.content
# Create a list of messages for the LLM to generate a response
eval_dataset = [
{
"inputs": {
"messages": [
{"role": "user", "content": "How much does a microwave cost?"},
]
},
},
{
"inputs": {
"messages": [
{
"role": "user",
"content": "Can I return the microwave I bought 2 months ago?",
},
]
},
},
{
"inputs": {
"messages": [
{
"role": "user",
"content": "I'm having trouble with my account. I can't log in.",
},
{
"role": "assistant",
"content": "I'm sorry to hear that you're having trouble with your account. Are you using our website or mobile app?",
},
{"role": "user", "content": "Website"},
]
},
},
]
@scorer
def dummy_metric():
# This scorer is just to help generate initial traces.
return 1
# Generate initial traces by running the sample_app.
# The results, including traces, are logged to the MLflow experiment defined above.
initial_eval_results = mlflow.genai.evaluate(
data=eval_dataset, predict_fn=sample_app, scorers=[dummy_metric]
)
generated_traces = mlflow.search_traces(run_id=initial_eval_results.run_id)
运行上述代码后,您在实验中应有三个轨迹。
示例 1:从追踪中访问数据
访问完整的 MLflow 跟踪对象 ,以使用各种详细信息(范围、输入、输出、属性、计时)进行精细的指标计算。
注释
先决条件 generated_traces
部分将用作这些示例的输入数据。
此评分器检查跟踪的总执行时间是否在可接受的范围内。
import mlflow
from mlflow.genai.scorers import scorer
from mlflow.entities import Trace, Feedback, SpanType
@scorer
def llm_response_time_good(trace: Trace) -> Feedback:
# Search particular span type from the trace
llm_span = trace.search_spans(span_type=SpanType.CHAT_MODEL)[0]
response_time = (llm_span.end_time_ns - llm_span.start_time_ns) / 1e9 # second
max_duration = 5.0
if response_time <= max_duration:
return Feedback(
value="yes",
rationale=f"LLM response time {response_time:.2f}s is within the {max_duration}s limit."
)
else:
return Feedback(
value="no",
rationale=f"LLM response time {response_time:.2f}s exceeds the {max_duration}s limit."
)
# Evaluate the scorer using the pre-generated traces from the prerequisite code block.
span_check_eval_results = mlflow.genai.evaluate(
data=generated_traces,
scorers=[llm_response_time_good]
)
示例 2:封装预定义 LLM 评估器
创建一个自定义评分器,用于封装 MLflow 的 预定义 LLM 评估器。 用它来预处理法官的跟踪数据,或者后处理反馈。
此示例演示如何包装 is_context_relevant
评估给定上下文是否与查询相关的法官,以评估助手的响应是否与用户的查询相关。
import mlflow
from mlflow.entities import Trace, Feedback
from mlflow.genai.judges import is_context_relevant
from mlflow.genai.scorers import scorer
from typing import Any
# Assume `generated_traces` is available from the prerequisite code block.
@scorer
def is_message_relevant(inputs: dict[str, Any], outputs: str) -> Feedback:
# The `inputs` field for `sample_app` is a dictionary like: {"messages": [{"role": ..., "content": ...}, ...]}
# We need to extract the content of the last user message to pass to the relevance judge.
last_user_message_content = None
if "messages" in inputs and isinstance(inputs["messages"], list):
for message in reversed(inputs["messages"]):
if message.get("role") == "user" and "content" in message:
last_user_message_content = message["content"]
break
if not last_user_message_content:
raise Exception("Could not extract the last user message from inputs to evaluate relevance.")
# Call the `relevance_to_query judge. It will return a Feedback object.
return is_context_relevant(
request=last_user_message_content,
context={"response": outputs},
)
# Evaluate the custom relevance scorer
custom_relevance_eval_results = mlflow.genai.evaluate(
data=generated_traces,
scorers=[is_message_relevant]
)
示例 3:使用 expectations
在使用包含字典列表或 Pandas 数据帧的参数调用 mlflow.genai.evaluate()
时,每行可以包含一个 expectations
键。 与此密钥关联的值直接传递给自定义评分器。
import mlflow
from mlflow.entities import Feedback
from mlflow.genai.scorers import scorer
from typing import Any, List, Optional, Union
expectations_eval_dataset_list = [
{
"inputs": {"messages": [{"role": "user", "content": "What is 2+2?"}]},
"expectations": {
"expected_response": "2+2 equals 4.",
"expected_keywords": ["4", "four", "equals"],
}
},
{
"inputs": {"messages": [{"role": "user", "content": "Describe MLflow in one sentence."}]},
"expectations": {
"expected_response": "MLflow is an open-source platform to streamline machine learning development, including tracking experiments, packaging code into reproducible runs, and sharing and deploying models.",
"expected_keywords": ["mlflow", "open-source", "platform", "machine learning"],
}
},
{
"inputs": {"messages": [{"role": "user", "content": "Say hello."}]},
"expectations": {
"expected_response": "Hello there!",
# No keywords needed for this one, but the field can be omitted or empty
}
}
]
示例 3.1:与预期响应完全匹配
此记分器检查助手的响应是否与expected_response
中提供的expectations
完全匹配。
@scorer
def exact_match(outputs: str, expectations: dict[str, Any]) -> bool:
# Scorer can return primitive value like bool, int, float, str, etc.
return outputs == expectations["expected_response"]
exact_match_eval_results = mlflow.genai.evaluate(
data=expectations_eval_dataset_list,
predict_fn=sample_app, # sample_app is from the prerequisite section
scorers=[exact_match]
)
示例 3.2:预期中的关键字状态检查
此评分器检查助手的响应中是否存在所有expected_keywords
expectations
内容。
@scorer
def keyword_presence_scorer(outputs: str, expectations: dict[str, Any]) -> Feedback:
expected_keywords = expectations.get("expected_keywords")
print(expected_keywords)
if expected_keywords is None:
return Feedback(
score=None, # Undetermined, as no keywords were expected
rationale="No 'expected_keywords' provided in expectations."
)
missing_keywords = []
for keyword in expected_keywords:
if keyword.lower() not in outputs.lower():
missing_keywords.append(keyword)
if not missing_keywords:
return Feedback(value="yes", rationale="All expected keywords are present in the response.")
else:
return Feedback(value="no", rationale=f"Missing keywords: {', '.join(missing_keywords)}.")
keyword_presence_eval_results = mlflow.genai.evaluate(
data=eval_dataset,
predict_fn=sample_app, # sample_app is from the prerequisite section
scorers=[keyword_presence_scorer]
)
示例 4:返回多个反馈对象
单个评分器可以返回对象列表 Feedback
,允许一个评分者同时评估多个质量方面(例如 PII、情绪、简洁性)。 理想情况下,每个 Feedback
对象都应拥有一个唯一的 name
(这将在结果中成为指标名称);否则,如果名称是自动生成的且发生冲突,它们可能会相互覆盖。 如果未提供名称,MLflow 将尝试基于记分器函数名称和索引生成一个名称。
此示例演示了一个计分器,它为每个跟踪返回两个不同的反馈片段:
-
is_not_empty_check
:一个布尔值,指示响应内容是否为非空。 -
response_char_length
:响应字符长度的数值。
import mlflow
from mlflow.genai.scorers import scorer
from mlflow.entities import Feedback, Trace # Ensure Feedback and Trace are imported
from typing import Any, Optional
# Assume `generated_traces` is available from the prerequisite code block.
@scorer
def comprehensive_response_checker(outputs: str) -> list[Feedback]:
feedbacks = []
# 1. Check if the response is not empty
feedbacks.append(
Feedback(name="is_not_empty_check", value="yes" if outputs != "" else "no")
)
# 2. Calculate response character length
char_length = len(outputs)
feedbacks.append(Feedback(name="response_char_length", value=char_length))
return feedbacks
multi_feedback_eval_results = mlflow.genai.evaluate(
data=generated_traces,
scorers=[comprehensive_response_checker]
)
结果将有两列:is_not_empty_check
和 response_char_length
作为评估。
示例 5:为法官使用自己的 LLM
在评分器中集成自定义或外部托管的 LLM。 评分器负责处理 API 调用、输入/输出格式,并从您的 LLM 响应中生成 Feedback
,从而对评分过程进行完全控制。
还可以设置 source
对象中的 Feedback
字段,以指示评估的来源是 LLM 评判。
import mlflow
import json
from mlflow.genai.scorers import scorer
from mlflow.entities import AssessmentSource, AssessmentSourceType, Feedback
from typing import Any, Optional
# Assume `generated_traces` is available from the prerequisite code block.
# Assume `client` (OpenAI SDK client configured for Databricks) is available from the prerequisite block.
# client = OpenAI(...)
# Define the prompts for the Judge LLM.
judge_system_prompt = """
You are an impartial AI assistant responsible for evaluating the quality of a response generated by another AI model.
Your evaluation should be based on the original user query and the AI's response.
Provide a quality score as an integer from 1 to 5 (1=Poor, 2=Fair, 3=Good, 4=Very Good, 5=Excellent).
Also, provide a brief rationale for your score.
Your output MUST be a single valid JSON object with two keys: "score" (an integer) and "rationale" (a string).
Example:
{"score": 4, "rationale": "The response was mostly accurate and helpful, addressing the user's query directly."}
"""
judge_user_prompt = """
Please evaluate the AI's Response below based on the Original User Query.
Original User Query:
```{user_query}```
AI's Response:
```{llm_response_from_app}```
Provide your evaluation strictly as a JSON object with "score" and "rationale" keys.
"""
@scorer
def answer_quality(inputs: dict[str, Any], outputs: str) -> Feedback:
user_query = inputs["messages"][-1]["content"]
# Call the Judge LLM using the OpenAI SDK client.
judge_llm_response_obj = client.chat.completions.create(
model="databricks-claude-3-7-sonnet", # This example uses Databricks hosted Claude. If you provide your own OpenAI credentials, replace with a valid OpenAI model e.g., gpt-4o-mini, etc.
messages=[
{"role": "system", "content": judge_system_prompt},
{"role": "user", "content": judge_user_prompt.format(user_query=user_query, llm_response_from_app=outputs)},
],
max_tokens=200, # Max tokens for the judge's rationale
temperature=0.0, # For more deterministic judging
)
judge_llm_output_text = judge_llm_response_obj.choices[0].message.content
# Parse the Judge LLM's JSON output.
judge_eval_json = json.loads(judge_llm_output_text)
parsed_score = int(judge_eval_json["score"])
parsed_rationale = judge_eval_json["rationale"]
return Feedback(
value=parsed_score,
rationale=parsed_rationale,
# Set the source of the assessment to indicate the LLM judge used to generate the feedback
source=AssessmentSource(
source_type=AssessmentSourceType.LLM_JUDGE,
source_id="claude-3-7-sonnet",
)
)
# Evaluate the scorer using the pre-generated traces.
custom_llm_judge_eval_results = mlflow.genai.evaluate(
data=generated_traces,
scorers=[answer_quality]
)
通过在 UI 中打开跟踪并单击“answer_quality”查看评估结果,可以查看评委的元数据,例如理由、时间戳、评估模型名称等。如果评估不正确,可以单击 Edit
按钮修改评分。
新的评估将取代原始法官评估,但将保留编辑历史记录以供将来参考。
后续步骤
继续您的旅程,并参考这些推荐的行动和教程。
- 使用自定义 LLM 评分器进行评估 - 使用 LLM 创建语义评估
- 在生产环境中运行记分器 - 部署记分器进行持续监视
- 生成评估数据集 - 为评分者创建测试数据
参考指南
浏览本指南中提到的概念和功能的详细文档。