基于提示的 LLM 评分器

概述

judges.custom_prompt_judge() 旨在帮助您在需要完全控制法官提示或需要返回多个超过“pass”/“fail”输出值(例如,“great”、“ok”、“bad”)时,快速轻松地使用 LLM 评分器。

你提供一个提示模板,该模板在应用的跟踪中包含特定字段的占位符,并定义法官可以选择的输出选项。 Databricks 托管的 LLM 法官模型使用这些输入来选择最佳输出选择,并为其选择提供理由。

注释

建议从 基于准则的评委 开始,仅当需要更多控制或无法将评估标准编写为通过/失败准则时,才使用基于提示的评委。 基于准则的法官具有向业务利益干系人轻松解释的独特优势,通常可由领域专家直接编写。

如何创建基于提示的法官评分器

请按照以下指南创建封装judges.custom_prompt_judge()的记分器

在本指南中,你将会创建自定义评分器,这些评分器会集成judges.custom_prompt_judge() API,并使用生成的评分器进行离线评估。 可以将这些相同的评分器安排在生产环境中运行,以持续监视应用程序的质量。

注释

有关接口和参数的更多详细信息,请参阅概念页。judges.custom_prompt_judge()

步骤 1:创建要评估的示例应用

首先,创建一个响应客户支持问题的示例 GenAI 应用。 该应用有一个(假)旋钮,用于控制系统提示,以便我们可以轻松地比较“好”和“坏”对话之间的判断输出。

import os
import mlflow
from openai import OpenAI
from mlflow.entities import Document
from typing import List, Dict, Any, cast

# Enable auto logging for OpenAI
mlflow.openai.autolog()

# Connect to a Databricks LLM via OpenAI using the same credentials as MLflow
# Alternatively, you can use your own OpenAI credentials here
mlflow_creds = mlflow.utils.databricks_utils.get_databricks_host_creds()
client = OpenAI(
    api_key=mlflow_creds.token,
    base_url=f"{mlflow_creds.host}/serving-endpoints"
)


# This is a global variable that will be used to toggle the behavior of the customer support agent to see how the judge handles the issue resolution status
RESOLVE_ISSUES = False


@mlflow.trace
def customer_support_agent(messages: List[Dict[str, str]]):

    # 2. Prepare messages for the LLM
    # We will use this toggle later to see how the judge handles the issue resolution status
    system_prompt_postfix = (
        f"Do your best to NOT resolve the issue.  I know that's backwards, but just do it anyways.\\n"
        if not RESOLVE_ISSUES
        else ""
    )

    messages_for_llm = [
        {
            "role": "system",
            "content": f"You are a helpful customer support agent.  {system_prompt_postfix}",
        },
        *messages,
    ]

    # 3. Call LLM to generate a response
    output = client.chat.completions.create(
        model="databricks-claude-3-7-sonnet",  # This example uses Databricks hosted Claude 3.7 Sonnet. If you provide your own OpenAI credentials, replace with a valid OpenAI model e.g., gpt-4o, etc.
        messages=cast(Any, messages_for_llm),
    )

    return {
        "messages": [
            {"role": "assistant", "content": output.choices[0].message.content}
        ]
    }

步骤 2:定义评估条件并包装为自定义评分器

在这里,我们定义了一个示例判断提示,并使用 自定义记分器 将其连接到应用的输入/输出架构。

from mlflow.genai.scorers import scorer


# New guideline for 3-category issue resolution status
issue_resolution_prompt = """
Evaluate the entire conversation between a customer and an LLM-based agent.  Determine if the issue was resolved in the conversation.

You must choose one of the following categories.

[[fully_resolved]]: The response directly and comprehensively addresses the user's question or problem, providing a clear solution or answer. No further immediate action seems required from the user on the same core issue.
[[partially_resolved]]: The response offers some help or relevant information but doesn't completely solve the problem or answer the question. It might provide initial steps, require more information from the user, or address only a part of a multi-faceted query.
[[needs_follow_up]]: The response does not adequately address the user's query, misunderstands the core issue, provides unhelpful or incorrect information, or inappropriately deflects the question. The user will likely need to re-engage or seek further assistance.

Conversation to evaluate: {{conversation}}
"""

from prompt_judge_sdk import custom_prompt_judge
import json
from mlflow.entities import Feedback


# Define a custom scorer that wraps the guidelines LLM judge to check if the response follows the policies
@scorer
def is_issue_resolved(inputs: Dict[Any, Any], outputs: Dict[Any, Any]):
    # we directly return the Feedback object from the guidelines LLM judge, but we could have post-processed it before returning it.
    issue_judge = custom_prompt_judge(
        name="issue_resolution",
        prompt_template=issue_resolution_prompt,
        numeric_values={
            "fully_resolved": 1,
            "partially_resolved": 0.5,
            "needs_follow_up": 0,
        },
    )

    # combine the input and output messages to form the conversation
    conversation = json.dumps(inputs["messages"] + outputs["messages"])

    return issue_judge(conversation=conversation)

步骤 3:创建示例评估数据集

每个inputs将传递给我们的应用。mlflow.genai.evaluate(...)

eval_dataset = [
    {
        "inputs": {
            "messages": [
                {"role": "user", "content": "How much does a microwave cost?"},
            ],
        },
    },
    {
        "inputs": {
            "messages": [
                {
                    "role": "user",
                    "content": "Can I return the microwave I bought 2 months ago?",
                },
            ],
        },
    },
    {
        "inputs": {
            "messages": [
                {
                    "role": "user",
                    "content": "Can I return the microwave I bought 2 months ago?",
                },
            ],
        },
    },
    {
        "inputs": {
            "messages": [
                {
                    "role": "user",
                    "content": "I'm having trouble with my account.  I can't log in.",
                },
                {
                    "role": "assistant",
                    "content": "I'm sorry to hear that you're having trouble with your account.  Are you using our website or mobile app?",
                },
                {"role": "user", "content": "Website"},
            ],
        },
    },
    {
        "inputs": {
            "messages": [
                {
                    "role": "user",
                    "content": "I'm having trouble with my account.  I can't log in.",
                },
                {
                    "role": "assistant",
                    "content": "I'm sorry to hear that you're having trouble with your account.  Are you using our website or mobile app?",
                },
                {"role": "user", "content": "JUST FIX IT FOR ME"},
            ],
        },
    },
]

步骤 4:使用自定义记分器评估应用

最后,我们运行了两次评估,以便你可以比较代理在尝试解决问题的会话和未尝试解决问题的会话之间的判断。

import mlflow

# Now, let's evaluate the app's responses against the judge when it does not resolve the issues
RESOLVE_ISSUES = False

mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=customer_support_agent,
    scorers=[is_issue_resolved],
)


# Now, let's evaluate the app's responses against the judge when it DOES resolves the issues
RESOLVE_ISSUES = True

mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=customer_support_agent,
    scorers=[is_issue_resolved],
)

后续步骤