ガイドラインベースの LLM スコアラーを作成する方法

2025-06-21

概要

[scorers.Guidelines()] と [scorers.ExpectationGuidelines()] は、judges.meets_guidelines()Databricks 提供の LLM ジャッジ SDK をラップするスコアラーです。これは、合格/失敗条件としてフレーム化された自然言語の条件を定義することで、評価をすばやく簡単にカスタマイズできるように設計されています。ルール、スタイルガイド、または情報の包含/除外に関するコンプライアンスを確認するのに最適です。

ガイドラインには、ビジネス関係者に簡単に説明できるという明確な利点があります ("アプリがこの一連のルールに基づいて提供されるかどうかを評価しています")。そのため、多くの場合、ドメインの専門家が直接記述できます。

LLM ジャッジモデルのガイドラインは、次の 2 つの方法で使用できます。

ガイドラインでアプリの入力と出力のみを考慮し、アプリのトレースに単純な入力 (ユーザークエリのみなど) と出力 (アプリの応答のみなど) がある場合は、事前構築済みのガイドラインスコアラーを使用します。
ガイドラインで追加のデータ (取得したドキュメントやツールの呼び出しなど) を検討している場合、またはトレースに、評価から除外するフィールド (user_idなど) を含む複雑な入力/出力がある場合は、judges.meets_guidelines() API をラップするカスタムスコアラーを作成します

注

事前構築済みのガイドラインスコアラーがトレースを解析する方法の詳細については、ガイドラインの事前構築済みスコアラーの概念に関するページを参照してください。

1. 事前構築済みのガイドラインスコアラーを使用する

このガイドでは、事前構築済みのスコアラーにカスタム評価基準を追加し、結果のスコアラーでオフライン評価を実行します。これらの同じスコアラーを運用環境で実行して、アプリケーションの品質を継続的に監視するようにスケジュールできます。

手順 1: 評価するサンプルアプリを作成する

まず、カスタマーサポートの質問に応答するサンプル GenAI アプリを作成します。アプリは、我々は簡単に「良い」と「悪い」応答の間でガイドラインジャッジの出力を比較できるように、システムプロンプトを制御するいくつかの(偽の)ノブを持っています。

import os
import mlflow
from openai import OpenAI
from typing import List, Dict, Any

mlflow.openai.autolog()

# Connect to a Databricks LLM via OpenAI using the same credentials as MLflow
# Alternatively, you can use your own OpenAI credentials here
mlflow_creds = mlflow.utils.databricks_utils.get_databricks_host_creds()
client = OpenAI(
    api_key=mlflow_creds.token,
    base_url=f"{mlflow_creds.host}/serving-endpoints"
)

# This is a global variable that will be used to toggle the behavior of the customer support agent to see how the guidelines scorers handle rude and verbose responses
BE_RUDE_AND_VERBOSE = False

@mlflow.trace
def customer_support_agent(messages: List[Dict[str, str]]):

    # 1. Prepare messages for the LLM
    system_prompt_postfix = (
        "Be super rude and very verbose in your responses."
        if BE_RUDE_AND_VERBOSE
        else ""
    )
    messages_for_llm = [
        {
            "role": "system",
            "content": f"You are a helpful customer support agent.  {system_prompt_postfix}",
        },
        *messages,
    ]

    # 2. Call LLM to generate a response
    return client.chat.completions.create(
        model="databricks-claude-3-7-sonnet",  # This example uses Databricks hosted Claude 3.7 Sonnet. If you provide your own OpenAI credentials, replace with a valid OpenAI model e.g., gpt-4o, etc.
        messages=messages_for_llm,
    )

result = customer_support_agent(
    messages=[
        {"role": "user", "content": "How much does a microwave cost?"},
    ]
)
print(result)

手順 2: 評価基準を定義する

通常は、ビジネス関係者と協力してガイドラインを定義します。ここでは、いくつかのサンプルガイドラインを定義します。ガイドラインを記述するときは、アプリの入力を the request として、アプリの出力を the responseと呼びます。 LLM ジャッジに渡されるデータを理解するには、定義済みのガイドラインスコアラーセクションで入力と出力がどのように解析されるかを参照してください。

tone = "The response must maintain a courteous, respectful tone throughout.  It must show empathy for customer concerns."
structure = "The response must use clear, concise language and structures responses logically.  It must avoids jargon or explains technical terms when used."
banned_topics = "If the request is a question about product pricing, the response must politely decline to answer and refer the user to the pricing page."
relevance = "The response must be relevant to the user's request.  Only consider the relevance and nothing else. If the request is not clear, the response must ask for more information."

注

ガイドラインは、望む長さに調整することができます。概念的には、ガイドラインは、合格条件を定義する "ミニプロンプト" と考えることができます。必要に応じて、マークダウンの書式設定 (箇条書きなど) を含めることができます。

手順 3: サンプル評価データセットを作成する

各 inputs は、 mlflow.genai.evaluate(...)によってアプリに渡されます。

eval_dataset = [
    {
        "inputs": {
            "messages": [
                {"role": "user", "content": "How much does a microwave cost?"},
            ]
        },
    },
    {
        "inputs": {
            "messages": [
                {
                    "role": "user",
                    "content": "I'm having trouble with my account.  I can't log in.",
                },
                {
                    "role": "assistant",
                    "content": "I'm sorry to hear that you're having trouble with your account.  Are you using our website or mobile app?",
                },
                {"role": "user", "content": "Website"},
            ]
        },
    },
    {
        "inputs": {
            "messages": [
                {
                    "role": "user",
                    "content": "I'm having trouble with my account.  I can't log in.",
                },
                {
                    "role": "assistant",
                    "content": "I'm sorry to hear that you're having trouble with your account.  Are you using our website or mobile app?",
                },
                {"role": "user", "content": "JUST FIX IT FOR ME"},
            ]
        },
    },
]
print(eval_dataset)

手順 4: カスタムスコアラーを使用してアプリを評価する

最後に、評価を2回行い、無礼な/冗長な（最初のスクリーンショット）と丁寧な/簡潔な（2番目のスクリーンショット）アプリのバージョンで、評価者の判断を比較できます。

from mlflow.genai.scorers import Guidelines
import mlflow

# First, let's evaluate the app's responses against the guidelines when it is not rude and verbose
BE_RUDE_AND_VERBOSE = False

mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=customer_support_agent,
    scorers=[
        Guidelines(name="tone", guidelines=tone),
        Guidelines(name="structure", guidelines=structure),
        Guidelines(name="banned_topics", guidelines=banned_topics),
        Guidelines(name="relevance", guidelines=relevance),
    ],
)


# Next, let's evaluate the app's responses against the guidelines when it IS rude and verbose
BE_RUDE_AND_VERBOSE = True

mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=customer_support_agent,
    scorers=[
        Guidelines(name="tone", guidelines=tone),
        Guidelines(name="structure", guidelines=structure),
        Guidelines(name="banned_topics", guidelines=banned_topics),
        Guidelines(name="relevance", guidelines=relevance),
    ],
)

失礼で冗長な評価

評価が丁寧で、詳細ではない

2. ガイドラインジャッジをラップするカスタムスコアラーを作成する

このガイドでは、 API をラップし、カスタムのjudges.meets_guidelines()を追加作成して、そのスコアラーでオフライン評価を実行します。これらの同じスコアラーを運用環境で実行して、アプリケーションの品質を継続的に監視するようにスケジュールできます。

手順 1: 評価するサンプルアプリを作成する

import os
import mlflow
from openai import OpenAI
from typing import List, Dict

mlflow.openai.autolog()

# Connect to a Databricks LLM via OpenAI using the same credentials as MLflow
# Alternatively, you can use your own OpenAI credentials here
mlflow_creds = mlflow.utils.databricks_utils.get_databricks_host_creds()
client = OpenAI(
    api_key=mlflow_creds.token,
    base_url=f"{mlflow_creds.host}/serving-endpoints"
)

# This is a global variable that will be used to toggle the behavior of the customer support agent to see how the guidelines scorers handle rude and verbose responses
FOLLOW_POLICIES = False

# This is a global variable that will be used to toggle the behavior of the customer support agent to see how the guidelines scorers handle rude and verbose responses
BE_RUDE_AND_VERBOSE = False

@mlflow.trace
def customer_support_agent(user_messages: List[Dict[str, str]], user_id: str):

    # 1. Fake policies to follow.
    @mlflow.trace
    def get_policies_for_user(user_id: str):
        if user_id == 1:
            return [
                "All returns must be processed within 30 days of purchase, with a valid receipt.",
            ]
        else:
            return [
                "All returns must be processed within 90 days of purchase, with a valid receipt.",
            ]

    policies_to_follow = get_policies_for_user(user_id)

    # 2. Prepare messages for the LLM
    # We will use this toggle later to see how the scorers handle rude and verbose responses
    system_prompt_postfix = (
        f"Follow the following policies: {policies_to_follow}.  Do not refer to the specific policies in your response.\n"
        if FOLLOW_POLICIES
        else ""
    )

    system_prompt_postfix = (
        f"{system_prompt_postfix}Be super rude and very verbose in your responses.\n"
        if BE_RUDE_AND_VERBOSE
        else system_prompt_postfix
    )
    messages_for_llm = [
        {
            "role": "system",
            "content": f"You are a helpful customer support agent.  {system_prompt_postfix}",
        },
        *user_messages,
    ]

    # 3. Call LLM to generate a response
    output = client.chat.completions.create(
        model="databricks-claude-3-7-sonnet",  # This example uses Databricks hosted Claude 3.7 Sonnet. If you provide your own OpenAI credentials, replace with a valid OpenAI model e.g., gpt-4o, etc.
        messages=messages_for_llm,
    )

    return {
        "message": output.choices[0].message.content,
        "policies_followed": policies_to_follow,
    }

result = customer_support_agent(
    user_messages=[
        {"role": "user", "content": "How much does a microwave cost?"},
    ],
    user_id=1
)
print(result)

手順 2: 評価基準を定義し、カスタムスコアラーとして設定する

通常は、ビジネス関係者と協力してガイドラインを定義します。ここでは、いくつかのサンプルガイドラインを定義し、カスタムスコアラーを使用して、アプリの入力/出力スキーマに合わせてそれらを結び付けます。

from mlflow.genai.scorers import scorer
from mlflow.genai.judges import meets_guidelines
import json
from typing import Dict, Any


tone = "The response must maintain a courteous, respectful tone throughout.  It must show empathy for customer concerns."
structure = "The response must use clear, concise language and structures responses logically.  It must avoids jargon or explains technical terms when used."
banned_topics = "If the request is a question about product pricing, the response must politely decline to answer and refer the user to the pricing page."
relevance = "The response must be relevant to the user's request.  Only consider the relevance and nothing else. If the request is not clear, the response must ask for more information."
# Note in this guideline how we refer to `provided_policies` - we will make the meets_guidelines LLM judge aware of this data.
follows_policies_guideline = "If the provided_policies is relevant to the request and response, the response must adhere to the provided_policies."

# Define a custom scorer that wraps the guidelines LLM judge to check if the response follows the policies
@scorer
def follows_policies(inputs: Dict[Any, Any], outputs: Dict[Any, Any]):
    # we directly return the Feedback object from the guidelines LLM judge, but we could have post-processed it before returning it.
    return meets_guidelines(
        name="follows_policies",
        guidelines=follows_policies_guideline,
        context={
            # Here we make meets_guidelines aware of
            "provided_policies": outputs["policies_followed"],
            "response": outputs["message"],
            "request": json.dumps(inputs["user_messages"]),
        },
    )


# Define a custom scorer that wraps the guidelines LLM judge to pass the custom keys from the inputs/outputs to the guidelines LLM judge
@scorer
def check_guidelines(inputs: Dict[Any, Any], outputs: Dict[Any, Any]):
    feedbacks = []

    request = json.dumps(inputs["user_messages"])
    response = outputs["message"]

    feedbacks.append(
        meets_guidelines(
            name="tone",
            guidelines=tone,
            # Note: While we used request and response as keys, we could have used any key as long as our guideline referred to that key by name (e.g., if we had used output instead of response, we would have changed our guideline to be "The output must be polite")
            context={"response": response},
        )
    )

    feedbacks.append(
        meets_guidelines(
            name="structure",
            guidelines=structure,
            context={"response": response},
        )
    )

    feedbacks.append(
        meets_guidelines(
            name="banned_topics",
            guidelines=banned_topics,
            context={"request": request, "response": response},
        )
    )

    feedbacks.append(
        meets_guidelines(
            name="relevance",
            guidelines=relevance,
            context={"request": request, "response": response},
        )
    )

    # A scorer can return a list of Feedback objects OR a single Feedback object.
    return feedbacks

注

手順 3: サンプル評価データセットを作成する

各 inputs は、 mlflow.genai.evaluate(...)によってアプリに渡されます。

eval_dataset = [
    {
        "inputs": {
            # Note that these keys match the **kwargs of our application.
            "user_messages": [
                {"role": "user", "content": "How much does a microwave cost?"},
            ],
            "user_id": 3,
        },
    },
    {
        "inputs": {
            "user_messages": [
                {
                    "role": "user",
                    "content": "Can I return the microwave I bought 2 months ago?",
                },
            ],
            "user_id": 1,  # the bot should say no if the policies are followed for this user
        },
    },
    {
        "inputs": {
            "user_messages": [
                {
                    "role": "user",
                    "content": "Can I return the microwave I bought 2 months ago?",
                },
            ],
            "user_id": 2,  # the bot should say yes if the policies are followed for this user
        },
    },
    {
        "inputs": {
            "user_messages": [
                {
                    "role": "user",
                    "content": "I'm having trouble with my account.  I can't log in.",
                },
                {
                    "role": "assistant",
                    "content": "I'm sorry to hear that you're having trouble with your account.  Are you using our website or mobile app?",
                },
                {"role": "user", "content": "Website"},
            ],
            "user_id": 3,
        },
    },
    {
        "inputs": {
            "user_messages": [
                {
                    "role": "user",
                    "content": "I'm having trouble with my account.  I can't log in.",
                },
                {
                    "role": "assistant",
                    "content": "I'm sorry to hear that you're having trouble with your account.  Are you using our website or mobile app?",
                },
                {"role": "user", "content": "JUST FIX IT FOR ME"},
            ],
            "user_id": 1,
        },
    },
]

print(eval_dataset)

手順 4: ガイドラインを使用してアプリを評価する

import mlflow

# Now, let's evaluate the app's responses against the guidelines when it is NOT rude and verbose and DOES follow policies
BE_RUDE_AND_VERBOSE = False
FOLLOW_POLICIES = True

mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=customer_support_agent,
    scorers=[follows_policies, check_guidelines],
)


# Now, let's evaluate the app's responses against the guidelines when it IS rude and verbose and does NOT follow policies
BE_RUDE_AND_VERBOSE = True
FOLLOW_POLICIES = False

mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=customer_support_agent,
    scorers=[follows_policies, check_guidelines],
)

失礼で冗長な評価

評価が丁寧で、詳細ではない

次のステップ

プロンプトベースのスコアラーを作成する - カスタムプロンプトと複数の出力選択肢を使用して、より複雑なジャッジを作成する
あなたのスコアラーで評価を実行 - 包括的な評価で独自のガイドラインのスコアラーを使用する
ガイドラインの概念リファレンス - ガイドラインの判断が内部でどのように機能するかを理解する

次の方法で共有

ガイドラインベースの LLM スコアラーを作成する方法

概要

1. 事前構築済みのガイドライン スコアラーを使用する

手順 1: 評価するサンプル アプリを作成する

手順 2: 評価基準を定義する

手順 3: サンプル評価データセットを作成する

手順 4: カスタム スコアラーを使用してアプリを評価する

2. ガイドラインジャッジをラップするカスタムスコアラーを作成する

手順 1: 評価するサンプル アプリを作成する

手順 2: 評価基準を定義し、カスタムスコアラーとして設定する

手順 3: サンプル評価データセットを作成する

手順 4: ガイドラインを使用してアプリを評価する

次のステップ

フィードバック

その他のリソース

1. 事前構築済みのガイドラインスコアラーを使用する

手順 1: 評価するサンプルアプリを作成する

手順 4: カスタムスコアラーを使用してアプリを評価する

手順 1: 評価するサンプルアプリを作成する