プロンプトベースの LLM スコアリング機能

2025-06-11

概要

judges.custom_prompt_judge() は、ジャッジのプロンプトを完全に制御する必要がある場合や、"pass" / "fail" を超える複数の出力値を返す必要がある場合 (例: "great"、"ok"、"bad" など) にすばやく簡単に LLM スコアラーを支援するように設計されています。

アプリのトレース内の特定のフィールドのプレースホルダーを含むプロンプトテンプレートを提供し、ジャッジが選択できる出力の選択肢を定義します。 Databricks でホストされる LLM ジャッジモデルでは、これらの入力を使用して最適な出力の選択肢を選択し、その選択の根拠を提供します。

注

ガイドラインベースのジャッジから始めて、より詳細な制御が必要な場合や、評価基準を合格/失敗ガイドラインとして記述できない場合にのみ、プロンプトベースのジャッジを使用することをお勧めします。ガイドラインベースのジャッジには、ビジネス利害関係者に簡単に説明できるという明確な利点があり、多くの場合、ドメインの専門家が直接書くことができます。

プロンプトベースのジャッジスコアラーを作成する方法

以下のガイドに従って、judges.custom_prompt_judge()を包み込むスコアラーを作成します。

このガイドでは、 API をラップし、結果のjudges.custom_prompt_judge()でオフライン評価を実行するカスタムスコアラーを作成します。これらの同じスコアラーを運用環境で実行して、アプリケーションの品質を継続的に監視するようにスケジュールできます。

注

インターフェイスとパラメーターの詳細については、 judges.custom_prompt_judge() の概念に関するページを参照してください。

手順 1: 評価するサンプルアプリを作成する

まず、カスタマーサポートの質問に応答するサンプル GenAI アプリを作成します。アプリには、"良い"と"悪い"会話の間でジャッジの出力を簡単に比較できるように、システムプロンプトを制御する(偽の)ノブがあります。

import os
import mlflow
from openai import OpenAI
from mlflow.entities import Document
from typing import List, Dict, Any, cast

# Enable auto logging for OpenAI
mlflow.openai.autolog()

# Connect to a Databricks LLM via OpenAI using the same credentials as MLflow
# Alternatively, you can use your own OpenAI credentials here
mlflow_creds = mlflow.utils.databricks_utils.get_databricks_host_creds()
client = OpenAI(
    api_key=mlflow_creds.token,
    base_url=f"{mlflow_creds.host}/serving-endpoints"
)


# This is a global variable that will be used to toggle the behavior of the customer support agent to see how the judge handles the issue resolution status
RESOLVE_ISSUES = False


@mlflow.trace
def customer_support_agent(messages: List[Dict[str, str]]):

    # 2. Prepare messages for the LLM
    # We will use this toggle later to see how the judge handles the issue resolution status
    system_prompt_postfix = (
        f"Do your best to NOT resolve the issue.  I know that's backwards, but just do it anyways.\\n"
        if not RESOLVE_ISSUES
        else ""
    )

    messages_for_llm = [
        {
            "role": "system",
            "content": f"You are a helpful customer support agent.  {system_prompt_postfix}",
        },
        *messages,
    ]

    # 3. Call LLM to generate a response
    output = client.chat.completions.create(
        model="databricks-claude-3-7-sonnet",  # This example uses Databricks hosted Claude 3.7 Sonnet. If you provide your own OpenAI credentials, replace with a valid OpenAI model e.g., gpt-4o, etc.
        messages=cast(Any, messages_for_llm),
    )

    return {
        "messages": [
            {"role": "assistant", "content": output.choices[0].message.content}
        ]
    }

手順 2: 評価基準を定義し、カスタムスコアラーとして設定する

ここでは、サンプルのジャッジプロンプトを定義し、カスタムスコアラーを使用してアプリの入力/出力スキーマに接続します。

from mlflow.genai.scorers import scorer


# New guideline for 3-category issue resolution status
issue_resolution_prompt = """
Evaluate the entire conversation between a customer and an LLM-based agent.  Determine if the issue was resolved in the conversation.

You must choose one of the following categories.

[[fully_resolved]]: The response directly and comprehensively addresses the user's question or problem, providing a clear solution or answer. No further immediate action seems required from the user on the same core issue.
[[partially_resolved]]: The response offers some help or relevant information but doesn't completely solve the problem or answer the question. It might provide initial steps, require more information from the user, or address only a part of a multi-faceted query.
[[needs_follow_up]]: The response does not adequately address the user's query, misunderstands the core issue, provides unhelpful or incorrect information, or inappropriately deflects the question. The user will likely need to re-engage or seek further assistance.

Conversation to evaluate: {{conversation}}
"""

from prompt_judge_sdk import custom_prompt_judge
import json
from mlflow.entities import Feedback


# Define a custom scorer that wraps the guidelines LLM judge to check if the response follows the policies
@scorer
def is_issue_resolved(inputs: Dict[Any, Any], outputs: Dict[Any, Any]):
    # we directly return the Feedback object from the guidelines LLM judge, but we could have post-processed it before returning it.
    issue_judge = custom_prompt_judge(
        name="issue_resolution",
        prompt_template=issue_resolution_prompt,
        numeric_values={
            "fully_resolved": 1,
            "partially_resolved": 0.5,
            "needs_follow_up": 0,
        },
    )

    # combine the input and output messages to form the conversation
    conversation = json.dumps(inputs["messages"] + outputs["messages"])

    return issue_judge(conversation=conversation)

手順 3: サンプル評価データセットを作成する

各 inputs は、 mlflow.genai.evaluate(...)によってアプリに渡されます。

eval_dataset = [
    {
        "inputs": {
            "messages": [
                {"role": "user", "content": "How much does a microwave cost?"},
            ],
        },
    },
    {
        "inputs": {
            "messages": [
                {
                    "role": "user",
                    "content": "Can I return the microwave I bought 2 months ago?",
                },
            ],
        },
    },
    {
        "inputs": {
            "messages": [
                {
                    "role": "user",
                    "content": "Can I return the microwave I bought 2 months ago?",
                },
            ],
        },
    },
    {
        "inputs": {
            "messages": [
                {
                    "role": "user",
                    "content": "I'm having trouble with my account.  I can't log in.",
                },
                {
                    "role": "assistant",
                    "content": "I'm sorry to hear that you're having trouble with your account.  Are you using our website or mobile app?",
                },
                {"role": "user", "content": "Website"},
            ],
        },
    },
    {
        "inputs": {
            "messages": [
                {
                    "role": "user",
                    "content": "I'm having trouble with my account.  I can't log in.",
                },
                {
                    "role": "assistant",
                    "content": "I'm sorry to hear that you're having trouble with your account.  Are you using our website or mobile app?",
                },
                {"role": "user", "content": "JUST FIX IT FOR ME"},
            ],
        },
    },
]

手順 4: カスタムスコアラーを使用してアプリを評価する

最後に、評価を 2 回実行して、エージェントが問題の解決を試みる会話と解決しない会話の判断を比較できます。

import mlflow

# Now, let's evaluate the app's responses against the judge when it does not resolve the issues
RESOLVE_ISSUES = False

mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=customer_support_agent,
    scorers=[is_issue_resolved],
)


# Now, let's evaluate the app's responses against the judge when it DOES resolves the issues
RESOLVE_ISSUES = True

mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=customer_support_agent,
    scorers=[is_issue_resolved],
)

次のステップ

ガイドラインベースのスコアラーを作成する - より単純な合格/失敗条件から始める (推奨)
スコアラーで評価を実行する - 包括的な評価でカスタムプロンプトベースのスコアラーを使用する
プロンプトベースのジャッジの概念リファレンス - プロンプトベースのジャッジのしくみを理解する

次の方法で共有

プロンプト ベースの LLM スコアリング機能

概要

プロンプトベースのジャッジ スコアラーを作成する方法

手順 1: 評価するサンプル アプリを作成する