如何创建基于准则的 LLM 评分器

概述

[c0 />] 和 [scorers.ExpectationGuidelines()] 是包装由 Databricks 提供的 judges.meets_guidelines()LLM 评审工具 SDK 的记分器。 它旨在通过定义通过/不通过条件的自然语言标准,快速轻松地自定义评估。 它非常适合检查是否符合规则、样式指南或信息的包含或排除。

准则具有向业务利益干系人轻松解释的独特优势(“我们正在评估应用是否提供这一组规则”),因此,通常可由域专家直接编写。

可以通过两种方式使用 LLM 评估模型指导原则:

  1. 如果你的准则只考虑应用的输入和输出,并且应用的跟踪只有简单的输入(例如,只有用户查询)和输出(例如,只有应用响应),请使用 预生成的指南记分器
  2. 如果你的指南考虑包含额外数据(例如检索的文档或工具调用),或者跟踪信息具有复杂的输入/输出,其中包含您希望在评估中排除的字段(例如用户 ID 等),请创建一个自定义评分器来封装 API。

注释

有关指南预构建评分器如何解析痕迹的更多详细信息,请访问 指南预构建评分器概念 页。

1. 使用预构建的指南评分器

在本指南中,你将向预生成的评分器添加自定义评估条件,并使用生成的评分器运行脱机评估。 可以将这些相同的评分器安排在生产环境中运行,以持续监视应用程序的质量。

步骤 1:创建要评估的示例应用

首先,创建一个响应客户支持问题的示例 GenAI 应用。 该应用有几个(假)旋钮来控制系统提示,以便我们可以轻松地比较“好”和“坏”响应之间的准则判断输出。

import os
import mlflow
from openai import OpenAI
from typing import List, Dict, Any

mlflow.openai.autolog()

# Connect to a Databricks LLM via OpenAI using the same credentials as MLflow
# Alternatively, you can use your own OpenAI credentials here
mlflow_creds = mlflow.utils.databricks_utils.get_databricks_host_creds()
client = OpenAI(
    api_key=mlflow_creds.token,
    base_url=f"{mlflow_creds.host}/serving-endpoints"
)

# This is a global variable that will be used to toggle the behavior of the customer support agent to see how the guidelines scorers handle rude and verbose responses
BE_RUDE_AND_VERBOSE = False

@mlflow.trace
def customer_support_agent(messages: List[Dict[str, str]]):

    # 1. Prepare messages for the LLM
    system_prompt_postfix = (
        "Be super rude and very verbose in your responses."
        if BE_RUDE_AND_VERBOSE
        else ""
    )
    messages_for_llm = [
        {
            "role": "system",
            "content": f"You are a helpful customer support agent.  {system_prompt_postfix}",
        },
        *messages,
    ]

    # 2. Call LLM to generate a response
    return client.chat.completions.create(
        model="databricks-claude-3-7-sonnet",  # This example uses Databricks hosted Claude 3.7 Sonnet. If you provide your own OpenAI credentials, replace with a valid OpenAI model e.g., gpt-4o, etc.
        messages=messages_for_llm,
    )

result = customer_support_agent(
    messages=[
        {"role": "user", "content": "How much does a microwave cost?"},
    ]
)
print(result)

步骤 2:定义评估条件

通常,你将与业务利益干系人合作来定义准则。 在这里,我们定义了一些示例准则。 编写指南时,将应用的输入引用为the request,将应用的输出引用为the response。 请参阅预定义指南评分器部分,了解输入和输出的解析方式,以便明白哪种数据会传递给 LLM 判定程序。

tone = "The response must maintain a courteous, respectful tone throughout.  It must show empathy for customer concerns."
structure = "The response must use clear, concise language and structures responses logically.  It must avoids jargon or explains technical terms when used."
banned_topics = "If the request is a question about product pricing, the response must politely decline to answer and refer the user to the pricing page."
relevance = "The response must be relevant to the user's request.  Only consider the relevance and nothing else. If the request is not clear, the response must ask for more information."

注释

准则的长短可以随意决定。 从概念上讲,可以将准则视为定义传递条件的“迷你提示”。 他们可以选择包括 markdown 格式(如项目符号列表)。

步骤 3:创建示例评估数据集

每个inputs将传递给我们的应用。mlflow.genai.evaluate(...)

eval_dataset = [
    {
        "inputs": {
            "messages": [
                {"role": "user", "content": "How much does a microwave cost?"},
            ]
        },
    },
    {
        "inputs": {
            "messages": [
                {
                    "role": "user",
                    "content": "I'm having trouble with my account.  I can't log in.",
                },
                {
                    "role": "assistant",
                    "content": "I'm sorry to hear that you're having trouble with your account.  Are you using our website or mobile app?",
                },
                {"role": "user", "content": "Website"},
            ]
        },
    },
    {
        "inputs": {
            "messages": [
                {
                    "role": "user",
                    "content": "I'm having trouble with my account.  I can't log in.",
                },
                {
                    "role": "assistant",
                    "content": "I'm sorry to hear that you're having trouble with your account.  Are you using our website or mobile app?",
                },
                {"role": "user", "content": "JUST FIX IT FOR ME"},
            ]
        },
    },
]
print(eval_dataset)

步骤 4:使用自定义评分器评估应用

最后,我们进行了两次评估,以便你可以比较指南评分器对粗鲁/冗长(第一个屏幕截图)和礼貌/简洁(第二个屏幕截图)应用版本的判断。

from mlflow.genai.scorers import Guidelines
import mlflow

# First, let's evaluate the app's responses against the guidelines when it is not rude and verbose
BE_RUDE_AND_VERBOSE = False

mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=customer_support_agent,
    scorers=[
        Guidelines(name="tone", guidelines=tone),
        Guidelines(name="structure", guidelines=structure),
        Guidelines(name="banned_topics", guidelines=banned_topics),
        Guidelines(name="relevance", guidelines=relevance),
    ],
)


# Next, let's evaluate the app's responses against the guidelines when it IS rude and verbose
BE_RUDE_AND_VERBOSE = True

mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=customer_support_agent,
    scorers=[
        Guidelines(name="tone", guidelines=tone),
        Guidelines(name="structure", guidelines=structure),
        Guidelines(name="banned_topics", guidelines=banned_topics),
        Guidelines(name="relevance", guidelines=relevance),
    ],
)

评估粗鲁和冗长

评估礼貌和简洁

2. 创建一个自定义记分器,用于封装准则判定

在本指南中,你将添加一个 自定义记分器 ,用于包装 judges.meets_guidelines() API,并使用生成的记分器运行脱机评估。 可以将这些相同的评分器安排在生产环境中运行,以持续监视应用程序的质量。

步骤 1:创建要评估的示例应用

首先,创建一个响应客户支持问题的示例 GenAI 应用。 该应用有几个(假)旋钮来控制系统提示,以便我们可以轻松地比较“好”和“坏”响应之间的准则判断输出。

import os
import mlflow
from openai import OpenAI
from typing import List, Dict

mlflow.openai.autolog()

# Connect to a Databricks LLM via OpenAI using the same credentials as MLflow
# Alternatively, you can use your own OpenAI credentials here
mlflow_creds = mlflow.utils.databricks_utils.get_databricks_host_creds()
client = OpenAI(
    api_key=mlflow_creds.token,
    base_url=f"{mlflow_creds.host}/serving-endpoints"
)

# This is a global variable that will be used to toggle the behavior of the customer support agent to see how the guidelines scorers handle rude and verbose responses
FOLLOW_POLICIES = False

# This is a global variable that will be used to toggle the behavior of the customer support agent to see how the guidelines scorers handle rude and verbose responses
BE_RUDE_AND_VERBOSE = False

@mlflow.trace
def customer_support_agent(user_messages: List[Dict[str, str]], user_id: str):

    # 1. Fake policies to follow.
    @mlflow.trace
    def get_policies_for_user(user_id: str):
        if user_id == 1:
            return [
                "All returns must be processed within 30 days of purchase, with a valid receipt.",
            ]
        else:
            return [
                "All returns must be processed within 90 days of purchase, with a valid receipt.",
            ]

    policies_to_follow = get_policies_for_user(user_id)

    # 2. Prepare messages for the LLM
    # We will use this toggle later to see how the scorers handle rude and verbose responses
    system_prompt_postfix = (
        f"Follow the following policies: {policies_to_follow}.  Do not refer to the specific policies in your response.\n"
        if FOLLOW_POLICIES
        else ""
    )

    system_prompt_postfix = (
        f"{system_prompt_postfix}Be super rude and very verbose in your responses.\n"
        if BE_RUDE_AND_VERBOSE
        else system_prompt_postfix
    )
    messages_for_llm = [
        {
            "role": "system",
            "content": f"You are a helpful customer support agent.  {system_prompt_postfix}",
        },
        *user_messages,
    ]

    # 3. Call LLM to generate a response
    output = client.chat.completions.create(
        model="databricks-claude-3-7-sonnet",  # This example uses Databricks hosted Claude 3.7 Sonnet. If you provide your own OpenAI credentials, replace with a valid OpenAI model e.g., gpt-4o, etc.
        messages=messages_for_llm,
    )

    return {
        "message": output.choices[0].message.content,
        "policies_followed": policies_to_follow,
    }

result = customer_support_agent(
    user_messages=[
        {"role": "user", "content": "How much does a microwave cost?"},
    ],
    user_id=1
)
print(result)

步骤 2:定义评估条件并包装为自定义评分器

通常,你将与业务利益干系人合作来定义准则。 在这里,我们定义了一些示例准则,并使用 自定义记分器 将它们连接到应用的输入/输出架构。

from mlflow.genai.scorers import scorer
from mlflow.genai.judges import meets_guidelines
import json
from typing import Dict, Any


tone = "The response must maintain a courteous, respectful tone throughout.  It must show empathy for customer concerns."
structure = "The response must use clear, concise language and structures responses logically.  It must avoids jargon or explains technical terms when used."
banned_topics = "If the request is a question about product pricing, the response must politely decline to answer and refer the user to the pricing page."
relevance = "The response must be relevant to the user's request.  Only consider the relevance and nothing else. If the request is not clear, the response must ask for more information."
# Note in this guideline how we refer to `provided_policies` - we will make the meets_guidelines LLM judge aware of this data.
follows_policies_guideline = "If the provided_policies is relevant to the request and response, the response must adhere to the provided_policies."

# Define a custom scorer that wraps the guidelines LLM judge to check if the response follows the policies
@scorer
def follows_policies(inputs: Dict[Any, Any], outputs: Dict[Any, Any]):
    # we directly return the Feedback object from the guidelines LLM judge, but we could have post-processed it before returning it.
    return meets_guidelines(
        name="follows_policies",
        guidelines=follows_policies_guideline,
        context={
            # Here we make meets_guidelines aware of
            "provided_policies": outputs["policies_followed"],
            "response": outputs["message"],
            "request": json.dumps(inputs["user_messages"]),
        },
    )


# Define a custom scorer that wraps the guidelines LLM judge to pass the custom keys from the inputs/outputs to the guidelines LLM judge
@scorer
def check_guidelines(inputs: Dict[Any, Any], outputs: Dict[Any, Any]):
    feedbacks = []

    request = json.dumps(inputs["user_messages"])
    response = outputs["message"]

    feedbacks.append(
        meets_guidelines(
            name="tone",
            guidelines=tone,
            # Note: While we used request and response as keys, we could have used any key as long as our guideline referred to that key by name (e.g., if we had used output instead of response, we would have changed our guideline to be "The output must be polite")
            context={"response": response},
        )
    )

    feedbacks.append(
        meets_guidelines(
            name="structure",
            guidelines=structure,
            context={"response": response},
        )
    )

    feedbacks.append(
        meets_guidelines(
            name="banned_topics",
            guidelines=banned_topics,
            context={"request": request, "response": response},
        )
    )

    feedbacks.append(
        meets_guidelines(
            name="relevance",
            guidelines=relevance,
            context={"request": request, "response": response},
        )
    )

    # A scorer can return a list of Feedback objects OR a single Feedback object.
    return feedbacks

注释

准则可以长短随意。 从概念上讲,可以将准则视为定义传递条件的“迷你提示”。 他们可以选择性地包含 markdown 格式(如项目符号列表)。

步骤 3:创建示例评估数据集

每个inputs将传递给我们的应用。mlflow.genai.evaluate(...)

eval_dataset = [
    {
        "inputs": {
            # Note that these keys match the **kwargs of our application.
            "user_messages": [
                {"role": "user", "content": "How much does a microwave cost?"},
            ],
            "user_id": 3,
        },
    },
    {
        "inputs": {
            "user_messages": [
                {
                    "role": "user",
                    "content": "Can I return the microwave I bought 2 months ago?",
                },
            ],
            "user_id": 1,  # the bot should say no if the policies are followed for this user
        },
    },
    {
        "inputs": {
            "user_messages": [
                {
                    "role": "user",
                    "content": "Can I return the microwave I bought 2 months ago?",
                },
            ],
            "user_id": 2,  # the bot should say yes if the policies are followed for this user
        },
    },
    {
        "inputs": {
            "user_messages": [
                {
                    "role": "user",
                    "content": "I'm having trouble with my account.  I can't log in.",
                },
                {
                    "role": "assistant",
                    "content": "I'm sorry to hear that you're having trouble with your account.  Are you using our website or mobile app?",
                },
                {"role": "user", "content": "Website"},
            ],
            "user_id": 3,
        },
    },
    {
        "inputs": {
            "user_messages": [
                {
                    "role": "user",
                    "content": "I'm having trouble with my account.  I can't log in.",
                },
                {
                    "role": "assistant",
                    "content": "I'm sorry to hear that you're having trouble with your account.  Are you using our website or mobile app?",
                },
                {"role": "user", "content": "JUST FIX IT FOR ME"},
            ],
            "user_id": 1,
        },
    },
]

print(eval_dataset)

步骤 4:使用准则评估应用

最后,我们运行评估两次,以便你可以比较礼貌/详细(第一个屏幕截图)和粗鲁/非详细(第二个屏幕截图)应用版本中评估标准评分者的判断。

import mlflow

# Now, let's evaluate the app's responses against the guidelines when it is NOT rude and verbose and DOES follow policies
BE_RUDE_AND_VERBOSE = False
FOLLOW_POLICIES = True

mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=customer_support_agent,
    scorers=[follows_policies, check_guidelines],
)


# Now, let's evaluate the app's responses against the guidelines when it IS rude and verbose and does NOT follow policies
BE_RUDE_AND_VERBOSE = True
FOLLOW_POLICIES = False

mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=customer_support_agent,
    scorers=[follows_policies, check_guidelines],
)

评估粗鲁和冗长

评估礼貌且简洁

后续步骤