カスタムスコアラーの作成

2025-06-11

概要

カスタムスコアラーは、GenAI アプリケーションの品質の測定方法を正確に定義する究極の柔軟性を提供します。カスタムスコアラーは、単純なヒューリスティック、高度なロジック、プログラムによる評価に基づいて、特定のビジネスユースケースに合わせて調整された評価メトリックを柔軟に定義できます。

次のシナリオでは、カスタムスコアラーを使用します。

カスタムのヒューリスティックまたはコードベースの評価メトリックの定義
定義済みの LLM スコアラーで、アプリのトレースからのデータを Databricks のリサーチに基づく LLM ジャッジにマップする方法をカスタマイズする
プロンプトベースの LLM スコアラーの記事を使用して、カスタムプロンプトテキストで LLM ジャッジを作成する。
評価に独自の LLM モデル (Databricks でホストされる LLM ジャッジモデルではなく) を使用する
定義済みの抽象化によって提供されるよりも高い柔軟性と制御が必要なその他のユースケース

注

カスタムスコアラーインターフェイスの詳細なリファレンスについては、スコアラーの概念ページまたは API ドキュメントを参照してください。

使用状況の概要

カスタムスコアラーは Python で記述され、アプリのトレースからのデータを完全に制御して評価できます。 1 つのカスタムスコアラーは、オフライン評価のために evaluate(...) ハーネスで動作するか、運用環境の監視のために create_monitor(...) に渡された場合の両方で機能します。

次の出力の種類がサポートされています。

成功/失敗文字列: "yes" or "no" 文字列値は、UI で "Pass" または "Fail" としてレンダリングされます。
数値: 序数値: 整数または浮動小数点数。
ブール値: True または False。
フィードバックオブジェクト: スコア、根拠、追加メタデータを含む Feedback オブジェクトを返します

カスタムスコアラーは、入力として次のことにアクセスできます。

スパン、属性、出力を含む完全な MLflow トレース。トレースは、インスタンス化された mlflow.entities.trace クラスとしてカスタムスコアラーに渡されます。
inputsディクショナリは、入力データセットまたはトレースからの MLflow ポストプロセスから派生します。
入力データセットまたはトレースから派生した outputs 値。 predict_fnが指定されている場合、outputs値はpredict_fnの戻り値になります。
入力データセットのexpectations フィールドから派生したexpectations ディクショナリ、またはトレースに関連付けられている評価。

@scorer デコレーターを使用すると、ユーザーは、mlflow.genai.evaluate()引数またはscorersを使用してcreate_monitor(...)に渡すことができるカスタム評価メトリックを定義できます。

scorer 関数は、以下のシグネチャに基づいて名前付き引数を使用して呼び出されます。名前付き引数はすべて省略可能なので、任意の組み合わせを使用できます。たとえば、引数として inputs と trace のみを持つスコアラーを定義し、 outputs と expectationsを省略できます。

from mlflow.genai.scorers import scorer
from typing import Optional, Any
from mlflow.entities import Feedback

@scorer
def my_custom_scorer(
  *,  # evaluate(...) harness will always call your scorer with named arguments
  inputs: Optional[dict[str, Any]],  # The agent's raw input, parsed from the Trace or dataset, as a Python dict
  outputs: Optional[Any],  # The agent's raw output, parsed from the Trace or
  expectations: Optional[dict[str, Any]],  # The expectations passed to evaluate(data=...), as a Python dict
  trace: Optional[mlflow.entities.Trace] # The app's resulting Trace containing spans and other metadata
) -> int | float | bool | str | Feedback | list[Feedback]

カスタムスコアラー開発アプローチ

メトリックを開発するときは、スコアラーに変更を加えるたびにアプリを実行しなくても、メトリックをすばやく反復処理する必要があります。これを行うには、次の手順をお勧めします。

手順 1: 初期メトリック、アプリ、評価データを定義する

import mlflow
from mlflow.entities import Trace
from mlflow.genai.scorers import scorer
from typing import Any

@mlflow.trace
def my_app(input_field_name: str):
    return {'output': input_field_name+'_output'}

@scorer
def my_metric() -> int:
    # placeholder return value
    return 1

eval_set = [{'inputs': {'input_field_name': 'test'}}]

アプリケーションを使用して`evaluate()`からトレースを生成する手順 2:

eval_results = mlflow.genai.evaluate(
    data=eval_set,
    predict_fn=my_app,
    scorers=[dummy_metric]
)

手順 3: 結果のトレースに対してクエリを実行して格納する

generated_traces = mlflow.search_traces(run_id=eval_results.run_id)

手順 4: メトリックを反復処理するときに、結果のトレースを入力として `evaluate()` に渡す

search_traces関数はトレースの Pandas DataFrame を返します。これは、入力データセットとしてevaluate()に直接渡すことができます。これにより、アプリを再実行することなく、メトリックをすばやく反復処理できます。

@scorer
def my_metric(outputs: Any):
    # Implement the actual metric logic here.
    return outputs == "test_output"

# Note the lack of a predict_fn parameter
mlflow.genai.evaluate(
    data=generated_traces,
    scorers=[my_metric],
)

カスタムスコアラーの例

このガイドでは、カスタムスコアラーを構築するためのさまざまなアプローチについて説明します。

カスタムスコアラー開発

前提条件: サンプルアプリケーションを作成し、トレースのローカルコピーを取得する

すべてのアプローチで、以下のサンプルアプリケーションとトレースのコピー ( 上記のアプローチを使用して抽出) を使用します。

import mlflow
from openai import OpenAI
from typing import Any
from mlflow.entities import Trace
from mlflow.genai.scorers import scorer

# Enable auto logging for OpenAI
mlflow.openai.autolog()

# Connect to a Databricks LLM via OpenAI using the same credentials as MLflow
# Alternatively, you can use your own OpenAI credentials here
mlflow_creds = mlflow.utils.databricks_utils.get_databricks_host_creds()
client = OpenAI(
    api_key=mlflow_creds.token,
    base_url=f"{mlflow_creds.host}/serving-endpoints"
)

@mlflow.trace
def sample_app(messages: list[dict[str, str]]):
    # 1. Prepare messages for the LLM
    messages_for_llm = [
        {"role": "system", "content": "You are a helpful assistant."},
        *messages,
    ]

    # 2. Call LLM to generate a response
    response = client.chat.completions.create(
        model="databricks-claude-3-7-sonnet",  # This example uses Databricks hosted Claude. If you provide your own OpenAI credentials, replace with a valid OpenAI model e.g., gpt-4o, etc.
        messages=messages_for_llm,
    )
    return response.choices[0].message.content


# Create a list of messages for the LLM to generate a response
eval_dataset = [
    {
        "inputs": {
            "messages": [
                {"role": "user", "content": "How much does a microwave cost?"},
            ]
        },
    },
    {
        "inputs": {
            "messages": [
                {
                    "role": "user",
                    "content": "Can I return the microwave I bought 2 months ago?",
                },
            ]
        },
    },
    {
        "inputs": {
            "messages": [
                {
                    "role": "user",
                    "content": "I'm having trouble with my account.  I can't log in.",
                },
                {
                    "role": "assistant",
                    "content": "I'm sorry to hear that you're having trouble with your account.  Are you using our website or mobile app?",
                },
                {"role": "user", "content": "Website"},
            ]
        },
    },
]


@scorer
def dummy_metric():
    # This scorer is just to help generate initial traces.
    return 1


# Generate initial traces by running the sample_app.
# The results, including traces, are logged to the MLflow experiment defined above.
initial_eval_results = mlflow.genai.evaluate(
    data=eval_dataset, predict_fn=sample_app, scorers=[dummy_metric]
)

generated_traces = mlflow.search_traces(run_id=initial_eval_results.run_id)

上記のコードを実行した後、実験には 3 つのトレースが必要です。

生成されたサンプルトレース

例 1: トレースからのデータへのアクセス

詳細なメトリック計算にさまざまな詳細 (スパン、入力、出力、属性、タイミング) を使用するには、完全な MLflow Trace オブジェクトにアクセスします。

注

前提条件セクションの generated_traces は、これらの例の入力データとして使用されます。

このスコアラーは、トレースの合計実行時間が許容範囲内にあるかどうかを確認します。

import mlflow
from mlflow.genai.scorers import scorer
from mlflow.entities import Trace, Feedback, SpanType

@scorer
def llm_response_time_good(trace: Trace) -> Feedback:
    # Search particular span type from the trace
    llm_span = trace.search_spans(span_type=SpanType.CHAT_MODEL)[0]

    response_time = (llm_span.end_time_ns - llm_span.start_time_ns) / 1e9 # second
    max_duration = 5.0
    if response_time <= max_duration:
        return Feedback(
            value="yes",
            rationale=f"LLM response time {response_time:.2f}s is within the {max_duration}s limit."
        )
    else:
        return Feedback(
            value="no",
            rationale=f"LLM response time {response_time:.2f}s exceeds the {max_duration}s limit."
        )

# Evaluate the scorer using the pre-generated traces from the prerequisite code block.
span_check_eval_results = mlflow.genai.evaluate(
    data=generated_traces,
    scorers=[llm_response_time_good]
)

例 2: 定義済みの LLM ジャッジをラップする

MLflow の定義済み LLM ジャッジをラップするカスタムスコアラーを作成します。これを使用して、ジャッジのトレースデータを前処理するか、フィードバックを後処理します。

この例では、特定のコンテキストがクエリに関連しているかどうかを評価する is_context_relevant ジャッジをラップして、アシスタントの応答がユーザーのクエリに関連しているかどうかを評価する方法を示します。

import mlflow
from mlflow.entities import Trace, Feedback
from mlflow.genai.judges import is_context_relevant
from mlflow.genai.scorers import scorer
from typing import Any

# Assume `generated_traces` is available from the prerequisite code block.

@scorer
def is_message_relevant(inputs: dict[str, Any], outputs: str) -> Feedback:
    # The `inputs` field for `sample_app` is a dictionary like: {"messages": [{"role": ..., "content": ...}, ...]}
    # We need to extract the content of the last user message to pass to the relevance judge.

    last_user_message_content = None
    if "messages" in inputs and isinstance(inputs["messages"], list):
        for message in reversed(inputs["messages"]):
            if message.get("role") == "user" and "content" in message:
                last_user_message_content = message["content"]
                break

    if not last_user_message_content:
        raise Exception("Could not extract the last user message from inputs to evaluate relevance.")

    # Call the `relevance_to_query judge. It will return a Feedback object.
    return is_context_relevant(
        request=last_user_message_content,
        context={"response": outputs},
    )

# Evaluate the custom relevance scorer
custom_relevance_eval_results = mlflow.genai.evaluate(
    data=generated_traces,
    scorers=[is_message_relevant]
)

例 3: 使用 `expectations`

ディクショナリまたは Pandas DataFrame のリストであるmlflow.genai.evaluate()引数を使用してdataを呼び出すと、各行にexpectations キーを含めることができます。このキーに関連付けられている値は、カスタムスコアラーに直接渡されます。

import mlflow
from mlflow.entities import Feedback
from mlflow.genai.scorers import scorer
from typing import Any, List, Optional, Union

expectations_eval_dataset_list = [
    {
        "inputs": {"messages": [{"role": "user", "content": "What is 2+2?"}]},
        "expectations": {
            "expected_response": "2+2 equals 4.",
            "expected_keywords": ["4", "four", "equals"],
        }
    },
    {
        "inputs": {"messages": [{"role": "user", "content": "Describe MLflow in one sentence."}]},
        "expectations": {
            "expected_response": "MLflow is an open-source platform to streamline machine learning development, including tracking experiments, packaging code into reproducible runs, and sharing and deploying models.",
            "expected_keywords": ["mlflow", "open-source", "platform", "machine learning"],
        }
    },
    {
        "inputs": {"messages": [{"role": "user", "content": "Say hello."}]},
        "expectations": {
            "expected_response": "Hello there!",
            # No keywords needed for this one, but the field can be omitted or empty
        }
    }
]

例 3.1: 予想される応答との完全一致

このスコアラーは、アシスタントの応答が、expected_responseで指定されたexpectationsと正確に一致するかどうかを確認します。

@scorer
def exact_match(outputs: str, expectations: dict[str, Any]) -> bool:
    # Scorer can return primitive value like bool, int, float, str, etc.
    return outputs == expectations["expected_response"]

exact_match_eval_results = mlflow.genai.evaluate(
    data=expectations_eval_dataset_list,
    predict_fn=sample_app, # sample_app is from the prerequisite section
    scorers=[exact_match]
)

例 3.2: 期待値からのキーワードプレゼンスチェック

このスコアラーは、expected_keywordsのすべてのexpectationsがアシスタントの応答に存在するかどうかを確認します。

@scorer
def keyword_presence_scorer(outputs: str, expectations: dict[str, Any]) -> Feedback:
    expected_keywords = expectations.get("expected_keywords")
    print(expected_keywords)
    if expected_keywords is None:
        return Feedback(
            score=None, # Undetermined, as no keywords were expected
            rationale="No 'expected_keywords' provided in expectations."
        )

    missing_keywords = []
    for keyword in expected_keywords:
        if keyword.lower() not in outputs.lower():
            missing_keywords.append(keyword)

    if not missing_keywords:
        return Feedback(value="yes", rationale="All expected keywords are present in the response.")
    else:
        return Feedback(value="no", rationale=f"Missing keywords: {', '.join(missing_keywords)}.")

keyword_presence_eval_results = mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=sample_app, # sample_app is from the prerequisite section
    scorers=[keyword_presence_scorer]
)

例 4: 複数のフィードバックオブジェクトを返す

1 つのスコアラーは、 Feedback オブジェクトのリストを返すことができます。これにより、1 つのスコアラーは複数の品質ファセット (PII、センチメント、簡潔さなど) を同時に評価できます。各 Feedback オブジェクトには、一意の name (結果のメトリック名になります) が必要です。それ以外の場合、名前が自動生成されて競合する場合は、互いに上書きされる可能性があります。名前が指定されていない場合、MLflow はスコア付け関数名とインデックスに基づいて名前の生成を試みます。

この例では、トレースごとに 2 つの個別のフィードバックを返すスコアラーを示します。

is_not_empty_check: 応答の内容が空でないかどうかを示すブール値。
response_char_length: 応答の文字長の数値。

import mlflow
from mlflow.genai.scorers import scorer
from mlflow.entities import Feedback, Trace # Ensure Feedback and Trace are imported
from typing import Any, Optional

# Assume `generated_traces` is available from the prerequisite code block.

@scorer
def comprehensive_response_checker(outputs: str) -> list[Feedback]:
    feedbacks = []
    # 1. Check if the response is not empty
    feedbacks.append(
        Feedback(name="is_not_empty_check", value="yes" if outputs != "" else "no")
    )
    # 2. Calculate response character length
    char_length = len(outputs)
    feedbacks.append(Feedback(name="response_char_length", value=char_length))
    return feedbacks

multi_feedback_eval_results = mlflow.genai.evaluate(
    data=generated_traces,
    scorers=[comprehensive_response_checker]
)

結果には、評価として is_not_empty_check と response_char_length の 2 つの列があります。

複数のフィードバックの結果

例 5: 審査員に独自の LLM を使用する

スコアラー内にカスタムまたは外部でホストされている LLM を統合します。スコアラーは API 呼び出し、入力/出力の書式設定を処理し、LLM の応答から Feedback を生成し、判定プロセスを完全に制御します。

評価のソースが LLM ジャッジであることを示すために、source オブジェクトのFeedback フィールドを設定することもできます。

import mlflow
import json
from mlflow.genai.scorers import scorer
from mlflow.entities import AssessmentSource, AssessmentSourceType, Feedback
from typing import Any, Optional


# Assume `generated_traces` is available from the prerequisite code block.
# Assume `client` (OpenAI SDK client configured for Databricks) is available from the prerequisite block.
# client = OpenAI(...)

# Define the prompts for the Judge LLM.
judge_system_prompt = """
You are an impartial AI assistant responsible for evaluating the quality of a response generated by another AI model.
Your evaluation should be based on the original user query and the AI's response.
Provide a quality score as an integer from 1 to 5 (1=Poor, 2=Fair, 3=Good, 4=Very Good, 5=Excellent).
Also, provide a brief rationale for your score.

Your output MUST be a single valid JSON object with two keys: "score" (an integer) and "rationale" (a string).
Example:
{"score": 4, "rationale": "The response was mostly accurate and helpful, addressing the user's query directly."}
"""
judge_user_prompt = """
Please evaluate the AI's Response below based on the Original User Query.

Original User Query:
```{user_query}```

AI's Response:
```{llm_response_from_app}```

Provide your evaluation strictly as a JSON object with "score" and "rationale" keys.
"""

@scorer
def answer_quality(inputs: dict[str, Any], outputs: str) -> Feedback:
    user_query = inputs["messages"][-1]["content"]

    # Call the Judge LLM using the OpenAI SDK client.
    judge_llm_response_obj = client.chat.completions.create(
        model="databricks-claude-3-7-sonnet",  # This example uses Databricks hosted Claude. If you provide your own OpenAI credentials, replace with a valid OpenAI model e.g., gpt-4o-mini, etc.
        messages=[
            {"role": "system", "content": judge_system_prompt},
            {"role": "user", "content": judge_user_prompt.format(user_query=user_query, llm_response_from_app=outputs)},
        ],
        max_tokens=200,  # Max tokens for the judge's rationale
        temperature=0.0, # For more deterministic judging
    )
    judge_llm_output_text = judge_llm_response_obj.choices[0].message.content

    # Parse the Judge LLM's JSON output.
    judge_eval_json = json.loads(judge_llm_output_text)
    parsed_score = int(judge_eval_json["score"])
    parsed_rationale = judge_eval_json["rationale"]

    return Feedback(
        value=parsed_score,
        rationale=parsed_rationale,
        # Set the source of the assessment to indicate the LLM judge used to generate the feedback
        source=AssessmentSource(
            source_type=AssessmentSourceType.LLM_JUDGE,
            source_id="claude-3-7-sonnet",
        )
    )


# Evaluate the scorer using the pre-generated traces.
custom_llm_judge_eval_results = mlflow.genai.evaluate(
    data=generated_traces,
    scorers=[answer_quality]
)

UI でトレースを開き、"answer_quality" 評価をクリックすると、根拠、タイムスタンプ、ジャッジモデル名など、ジャッジのメタデータを確認できます。ジャッジ評価が正しくない場合は、[ Edit ] ボタンをクリックしてスコアをオーバーライドできます。

新しい評価は元のジャッジ評価よりも優先されますが、編集履歴は将来の参照のために保持されます。

LLM 審査評価の編集

次のステップ

これらの推奨されるアクションとチュートリアルを使用して、体験を続けます。

カスタム LLM スコアラーによる評価 - LLM を使用してセマンティック評価を作成する
運用環境でスコアラーを実行する - 継続的な監視のためにスコアラーをデプロイする
評価データセットの作成 - スコアラーのテストデータを作成する

リファレンスガイド

このガイドで説明されている概念と機能の詳細なドキュメントを確認します。

スコアラー - スコアラーのしくみとそのアーキテクチャの詳細
評価ハーネス - mlflow.genai.evaluate() がスコアラーをどのように活用するかを理解する
LLM のジャッジ - AI を活用した評価の基礎を学ぶ

次の方法で共有

カスタム スコアラーの作成

概要

使用状況の概要

カスタム スコアラー開発アプローチ

手順 1: 初期メトリック、アプリ、評価データを定義する

アプリケーションを使用してevaluate()からトレースを生成する手順 2:

手順 3: 結果のトレースに対してクエリを実行して格納する

手順 4: メトリックを反復処理するときに、結果のトレースを入力として evaluate() に渡す

カスタム スコアラーの例

前提条件: サンプル アプリケーションを作成し、トレースのローカル コピーを取得する