定義済みの LLM スコアラーを使用する

2025-06-11

概要

MLflow は、Scorersを組み込み、一般的な品質ディメンションにわたってトレースを評価できる built-in LLM を提供します。

Von Bedeutung

通常は、定義済みのスコアラーを使用して評価を開始できますが、アプリケーションロジックと評価基準がより複雑になる (または、アプリケーションのトレースがスコアラーの要件を満たしていない) 場合は、基になるジャッジをカスタムスコアラーでラップするか、カスタム LLM スコアラーを作成するように切り替えます。

ヒント

代わりにカスタムスコアラーを使用する場合:

アプリケーションに、定義済みのスコアラーが解析できない複雑な入力/出力がある
特定のビジネスロジックまたはドメイン固有の条件を評価する必要がある
複数の評価要素を 1 つのスコアラーに結合する
トレース構造が定義済みのスコアラー要件と一致しない

詳細な例については、カスタムスコアラーガイドとカスタムLLMジャッジガイドを参照してください。

定義済みのスコアラーのしくみ

evaluate()または監視サービスによってトレースが渡されると、定義済みのスコアラーは次のようになります。

traceを解析し、ラップするLLMジャッジに必要なデータを抽出します。
LLM ジャッジを呼び出して生成します。Feedback
- フィードバックには、スコアの理由を説明する書面による根拠と共に、 yes/no スコアが含まれています。
フィードバックをその呼び出し元に返してトレースに付けます。

注

MLflow が入力をスコアラーに渡し、結果のフィードバックをスコアラーからトレースにアタッチする方法の詳細については、スコアラーの概念ガイドを参照してください。

[前提条件]

次のコマンドを実行して、MLflow 3.0 パッケージと OpenAI パッケージをインストールします。
```
pip install --upgrade "mlflow[databricks]>=3.1.0" openai
```
トレーシングのクイックスタートに従い、開発環境をMLflow実験に接続します。

手順 1: 評価するサンプルアプリケーションを作成する

以下では、偽のレトリバーを使用して単純なアプリケーションを定義します。

import os
import mlflow
from openai import OpenAI
from mlflow.entities import Document
from typing import List

mlflow.openai.autolog()

# Connect to a Databricks LLM via OpenAI using the same credentials as MLflow
# Alternatively, you can use your own OpenAI credentials here
mlflow_creds = mlflow.utils.databricks_utils.get_databricks_host_creds()
client = OpenAI(
    api_key=mlflow_creds.token,
    base_url=f"{mlflow_creds.host}/serving-endpoints"
)


# Retriever function called by the sample app
@mlflow.trace(span_type="RETRIEVER")
def retrieve_docs(query: str) -> List[Document]:
    return [
        Document(
            id="sql_doc_1",
            page_content="SELECT is a fundamental SQL command used to retrieve data from a database. You can specify columns and use a WHERE clause to filter results.",
            metadata={"doc_uri": "http://example.com/sql/select_statement"},
        ),
        Document(
            id="sql_doc_2",
            page_content="JOIN clauses in SQL are used to combine rows from two or more tables, based on a related column between them. Common types include INNER JOIN, LEFT JOIN, and RIGHT JOIN.",
            metadata={"doc_uri": "http://example.com/sql/join_clauses"},
        ),
        Document(
            id="sql_doc_3",
            page_content="Aggregate functions in SQL, such as COUNT(), SUM(), AVG(), MIN(), and MAX(), perform calculations on a set of values and return a single summary value.  The most common aggregate function in SQL is COUNT().",
            metadata={"doc_uri": "http://example.com/sql/aggregate_functions"},
        ),
    ]


# Sample app that we will evaluate
@mlflow.trace
def sample_app(query: str):
    # 1. Retrieve documents based on the query
    retrieved_documents = retrieve_docs(query=query)
    retrieved_docs_text = "\n".join([doc.page_content for doc in retrieved_documents])

    # 2. Prepare messages for the LLM
    messages_for_llm = [
        {
            "role": "system",
            # Fake prompt to show how the various scorers identify quality issues.
            "content": f"Answer the user's question based on the following retrieved context: {retrieved_docs_text}.  Do not mention the fact that provided context exists in your answer.  If the context is not relevant to the question, generate the best response you can.",
        },
        {
            "role": "user",
            "content": query,
        },
    ]

    # 3. Call LLM to generate the response
    return client.chat.completions.create(
        # This example uses Databricks hosted Claude.  If you provide your own OpenAI credentials, replace with a valid OpenAI model e.g., gpt-4o, etc.
        model="databricks-claude-3-7-sonnet",
        messages=messages_for_llm,
    )
result = sample_app("what is select in sql?")
print(result)

手順 2: サンプル評価データセットを作成する

注

expected_facts は、グラウンド真実を必要とする定義済みのスコアラーを使用する場合にのみ必要です。

eval_dataset = [
    {
        "inputs": {"query": "What is the most common aggregate function in SQL?"},
        "expectations": {
            "expected_facts": ["Most common aggregate function in SQL is COUNT()."],
        },
    },
    {
        "inputs": {"query": "How do I use MLflow?"},
        "expectations": {
            "expected_facts": [
                "MLflow is a tool for managing and tracking machine learning experiments."
            ],
        },
    },
]
print(eval_dataset)

手順 3: 定義済みのスコアラーを使用して評価を実行する

次に、上記で定義したスコアラーで評価を実行してみましょう。

from mlflow.genai.scorers import (
    Correctness,
    Guidelines,
    RelevanceToQuery,
    RetrievalGroundedness,
    RetrievalRelevance,
    RetrievalSufficiency,
    Safety,
)


# Run predefined scorers that require ground truth
mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=sample_app,
    scorers=[
        Correctness(),
        # RelevanceToQuery(),
        # RetrievalGroundedness(),
        # RetrievalRelevance(),
        RetrievalSufficiency(),
        # Safety(),
    ],
)


# Run predefined scorers that do NOT require ground truth
mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=sample_app,
    scorers=[
        # Correctness(),
        RelevanceToQuery(),
        RetrievalGroundedness(),
        RetrievalRelevance(),
        # RetrievalSufficiency(),
        Safety(),
        Guidelines(name="does_not_mention", guidelines="The response not mention the fact that provided context exists.")
    ],
)

評価追跡

評価ユーザーインターフェース (UI)

使用可能なスコアラー

スコアラー	何を評価していますか？	実際のデータが必要ですか?	詳細情報
`RelevanceToQuery`	アプリの応答はユーザーの入力に直接対処しますか?	いいえ	回答とコンテキストの関連性ガイド
`Safety`	アプリの応答は、有害または有毒なコンテンツを回避しますか?	いいえ	安全ガイド
`RetrievalGroundedness`	アプリの応答は、取得された情報に固定されていますか?	いいえ	グラウンディングガイド
`RetrievalRelevance`	取得されたドキュメントはユーザーの要求に関連していますか?	いいえ	回答とコンテキストの関連性ガイド
`Correctness`	アプリの応答は、正解データと比較して正しいですか?	イエス	正確性ガイド
`RetrievalSufficiency`	取得されたドキュメントには、必要なすべての情報が含まれていますか?	イエス	コンテキスト機能ガイド

次のステップ

これらの推奨されるアクションとチュートリアルを使用して、体験を続けます。

カスタムスコアラーを作成する - 特定のニーズに合わせてコードベースのメトリックを構築する
カスタム LLM スコアラーの作成 - LLM を使用して高度な評価基準を設計する
アプリを評価する - 完全な例を使用して、定義済みのスコアラーの動作を確認する

リファレンスガイド

このガイドで説明されている概念と機能の詳細なドキュメントを確認します。

事前構築されたジャッジとスコアラーリファレンス - 利用可能なすべてのジャッジの包括的な概要
スコアラー - スコアラーのしくみと、その評価における役割を理解する
LLM のジャッジ - 基になるジャッジアーキテクチャについて学ぶ

次の方法で共有

定義済みの LLM スコアラーを使用する

概要

定義済みのスコアラーのしくみ

[前提条件]

手順 1: 評価するサンプル アプリケーションを作成する

手順 2: サンプル評価データセットを作成する

手順 3: 定義済みのスコアラーを使用して評価を実行する

使用可能なスコアラー

次のステップ

リファレンス ガイド

フィードバック

その他のリソース

手順 1: 評価するサンプルアプリケーションを作成する

リファレンスガイド