カスタムメトリクス (レガシー)

2025-06-11

Von Bedeutung

このページでは、MLflow <0.22を使用したエージェント評価<2.xについて説明します。 Databricks では、エージェント評価 >1.0と統合された MLflow 3 を使用することをお勧めします。エージェント評価 SDK メソッドは、 mlflow SDK を介して公開されるようになりました。

このトピックの詳細については、「カスタムスコアラーの作成」を参照してください。

このガイドでは、Mosaic AI Agent Framework 内で AI アプリケーションを評価するためにカスタムメトリックを使用する方法について説明します。カスタムメトリックは、単純なヒューリスティック、高度なロジック、またはプログラムによる評価に基づいて、特定のビジネスユースケースに合わせて調整された評価メトリックを柔軟に定義できます。

概要

カスタムメトリックは Python で記述され、開発者は AI アプリケーションを介してトレースを評価するためのフルコントロールを提供します。次のメトリックがサポートされています。

成功/失敗メトリック: "yes" or "no" 文字列値は、UI で "Pass" または "Fail" として表示されます。
数値メトリック: 序数値: 整数または浮動小数点数。
ブールメトリック: True または False。

カスタムメトリックでは、次を使用できます。

評価行の任意のフィールド。
追加の予期される値の custom_expected フィールド。
スパン、属性、出力など、MLflow トレースへの完全なアクセス。

使用法

カスタムメトリックは、 mlflow.evaluate() の extra_metrics フィールドを使用して評価フレームワークに渡されます。例：

import mlflow
from databricks.agents.evals import metric

@metric
def not_empty(response):
    # "yes" for Pass and "no" for Fail.
    return "yes" if response.choices[0]['message']['content'].strip() != "" else "no"

@mlflow.trace(span_type="CHAT_MODEL")
def my_model(request):
    deploy_client = mlflow.deployments.get_deploy_client("databricks")
    return deploy_client.predict(
        endpoint="databricks-meta-llama-3-3-70b-instruct", inputs=request
    )

with mlflow.start_run(run_name="example_run"):
    eval_results = mlflow.evaluate(
        data=[{"request": "Good morning"}],
        model=my_model,
        model_type="databricks-agent",
        extra_metrics=[not_empty],
    )
    display(eval_results.tables["eval_results"])

`@metric` デコレータ

@metricデコレーターを使用すると、ユーザーは、引数を使用して extra_metrics に渡すことができるカスタム評価メトリックを定義できます。評価ハーネスは、次のシグネチャに基づいて名前付き引数を使用してメトリック関数を呼び出します。

def my_metric(
  *,  # eval harness will always call it with named arguments
  request: Dict[str, Any],  # The agent's raw input as a serializable object
  response: Optional[Dict[str, Any]],  # The agent's raw output; directly passed from the eval harness
  retrieved_context: Optional[List[Dict[str, str]]],  # Retrieved context, either from input eval data or extracted from the trace
  expected_response: Optional[str],  # The expected output as defined in the evaluation dataset
  expected_facts: Optional[List[str]],  # A list of expected facts that can be compared against the output
  guidelines: Optional[Union[List[str], Dict[str, List[str]]]]  # A list of guidelines or mapping a name of guideline to an array of guidelines for that name
  expected_retrieved_context: Optional[List[Dict[str, str]]],  # Expected context for retrieval tasks
  trace: Optional[mlflow.entities.Trace],  # The trace object containing spans and other metadata
  custom_expected: Optional[Dict[str, Any]],  # A user-defined dictionary of extra expected values
  tool_calls: Optional[List[ToolCallInvocation]],
) -> float | bool | str | Assessment

引数の説明

request: 任意のシリアル化可能なオブジェクトとして書式設定された、エージェントに提供される入力。これは、ユーザークエリまたはプロンプトを表します。
response: 任意のシリアル化可能な任意のオブジェクトとして書式設定された、エージェントからの生出力。エージェントが生成した応答が評価のために含まれています。
retrieved_context: タスク中に取得されたコンテキストを含むディクショナリの一覧。このコンテキストは、入力評価データセットまたはトレースから取得できます。ユーザーは、trace フィールドを使用して抽出をオーバーライドまたはカスタマイズできます。
expected_response: タスクの正しい応答または目的の応答を表す文字列。これは、エージェントの応答と比較するための地上の真実として機能します。
expected_facts: エージェントの応答に表示されることが予想されるファクトの一覧。ファクトチェックタスクに役立ちます。
guidelines: ガイドラインの一覧、またはガイドライン名をそれに対応する複数のガイドラインにマッピングしたもの。ガイドラインを使用すると、ガイドライン準拠の判事によって評価できる任意のフィールドに制約を提供できます。
expected_retrieved_context: 予想される取得コンテキストを表すディクショナリの一覧。これは、取得されたデータの正確性が重要な取得拡張タスクに不可欠です。
trace: オプションの MLflow Trace 、エージェントの実行に関するスパン、属性、およびその他のメタデータを含むオブジェクトです。これにより、エージェントによって実行される内部ステップの詳細な検査が可能になります。
custom_expected: ユーザー定義の期待値を渡すためのディクショナリ。このフィールドを使用すると、標準フィールドでカバーされていない追加のカスタム期待値を柔軟に含めることができます。
tool_calls: 呼び出されたツールと返されたツールを説明する ToolCallInvocation の一覧。

戻り値

カスタムメトリックの戻り値は、行ごとの評価です。プリミティブを返す場合、根拠が空の Assessment にラップされます。

float: 数値メトリック (類似性スコア、精度の割合など) の場合。
bool: バイナリメトリックの場合。
Assessment または list[Assessment]: 根拠の追加をサポートする豊富な出力の種類。評価の一覧を返す場合は、同じメトリック関数を再利用して複数の評価を返すことができます。
- name: 評価の名前。
- value: 値 (float、int、bool、または string)。
- rationale: (省略可能) この値がどのように計算されたかを説明する根拠。これは、UI に追加の推論を表示する場合に役立ちます。このフィールドは、たとえば、この評価を生成した LLM から推論を提供する場合に便利です。

成功/失敗指標

"yes" と "no" を返す文字列メトリックは、合格/失敗メトリックとして扱われ、UI で特別な処理が行われます。

また、呼び出し可能なジャッジ Python SDK を使用して、合格/失敗メトリックを作成することもできます。これにより、評価するトレースの各部分と、使用する必要があるフィールドをより詳細に制御できます。いずれかの組み込み Mosaic AI Agent Evaluation ジャッジを使用できます。組み込みの AI ジャッジ (レガシ) を参照してください。

取得したコンテキストに PII がないことを確認する

この例では、 guideline_adherence ジャッジを呼び出して、取得したコンテキストに PII がないことを確認します。

import mlflow
import pandas as pd
from databricks.agents.evals import metric
from databricks.agents.evals import judges

evals = [
  {
    "request": "Good morning",
    "response": "Good morning to you too!",
    "retrieved_context": [{
      "content": "The email address is noreply@databricks.com",
    }],
  }, {
    "request": "Good afternoon",
    "response": "This is actually the morning!",
    "retrieved_context": [{
      "content": "fake retrieved context",
    }],
  }
]

@metric
def retrieved_context_no_pii(request, response, retrieved_context):
  retrieved_content = '\n'.join([c['content'] for c in retrieved_context])
  return judges.guideline_adherence(
    request=request,
    # You can also pass in per-row guidelines by adding `guidelines` to the signature of your metric
    guidelines=[
      "The retrieved context must not contain personally identifiable information.",
    ],
    # `guidelines_context` requires `databricks-agents>=0.20.0`
    guidelines_context={"retrieved_context": retrieved_content},
  )

with mlflow.start_run(run_name="safety"):
    eval_results = mlflow.evaluate(
        data=pd.DataFrame.from_records(evals),
        model_type="databricks-agent",
        extra_metrics=[retrieved_context_no_pii],
        # Disable built-in judges.
        evaluator_config={
            'databricks-agent': {
                "metrics": [],
            }
        }
    )
    display(eval_results.tables['eval_results'])

数値メトリック

数値メトリックは、浮動小数点数や整数などの序数値を評価します。行ごとの UI に、評価実行の平均値と共に数値メトリックが表示されます。

例: 応答の類似性

このメトリックは、組み込みの python ライブラリ responseを使用して、expected_response と SequenceMatcher の類似性を測定します。

import mlflow
import pandas as pd
from databricks.agents.evals import metric
from difflib import SequenceMatcher

evals = [
  {
    "request": "Good morning",
    "response": "Good morning to you too!",
    "expected_response": "Hello and good morning to you!"
  }, {
    "request": "Good afternoon",
    "response": "I am an LLM and I cannot answer that question.",
    "expected_response": "Good afternoon to you too!"
  }
]

@metric
def response_similarity(response, expected_response):
  s = SequenceMatcher(a=response, b=expected_response)
  return s.ratio()

with mlflow.start_run(run_name="response_similarity"):
    eval_results = mlflow.evaluate(
        data=pd.DataFrame.from_records(evals),
        model_type="databricks-agent",
        extra_metrics=[response_similarity],
        evaluator_config={
            'databricks-agent': {
                "metrics": [],
            }
        }
    )
    display(eval_results.tables['eval_results'])

ブールメトリクス

ブールメトリックは、True または Falseに評価されます。これらは、応答が単純なヒューリスティックを満たしているかどうかを確認するなど、バイナリの決定に役立ちます。メトリックに UI で特殊な合格/失敗の処理を適用する場合は、合格/失敗のメトリックを参照してください。

例: 入力要求が正しく書式設定されていることを確認する

このメトリックは、任意の入力が想定どおりに書式設定されているかどうかを確認し、その場合は True 返します。

import mlflow
import pandas as pd
from databricks.agents.evals import metric

evals = [
  {
    "request": {"messages": [{"role": "user", "content": "Good morning"}]},
  }, {
    "request": {"inputs": ["Good afternoon"]},
  }, {
    "request": {"inputs": [1, 2, 3, 4]},
  }
]

@metric
def check_valid_format(request):
  # Check that the request contains a top-level key called "inputs" with a value of a list
  return "inputs" in request and isinstance(request.get("inputs"), list)

with mlflow.start_run(run_name="check_format"):
  eval_results = mlflow.evaluate(
      data=pd.DataFrame.from_records(evals),
      model_type="databricks-agent",
      extra_metrics=[check_valid_format],
      # Disable built-in judges.
      evaluator_config={
          'databricks-agent': {
              "metrics": [],
          }
      }
  )
eval_results.tables['eval_results']

例: 言語モデルの自己参照

このメトリックは、応答が "LLM" に言及しているかどうかを確認し、その場合は True を返します。

import mlflow
import pandas as pd
from databricks.agents.evals import metric

evals = [
  {
    "request": "Good morning",
    "response": "Good morning to you too!"
  }, {
    "request": "Good afternoon",
    "response": "I am an LLM and I cannot answer that question."
  }
]

@metric
def response_mentions_llm(response):
  return "LLM" in response

with mlflow.start_run(run_name="response_mentions_llm"):
    eval_results = mlflow.evaluate(
        data=pd.DataFrame.from_records(evals),
        model_type="databricks-agent",
        extra_metrics=[response_mentions_llm],
        evaluator_config={
            'databricks-agent': {
                "metrics": [],
            }
        }
    )
    display(eval_results.tables['eval_results'])

`custom_expected` の使用

custom_expected フィールドを使用して、他の必要な情報をカスタムメトリックに渡すことができます。

例: 応答長の境界付き

この例では、各例に設定された (min_length、max_length) 境界内に応答の長さを要求する方法を示します。 custom_expected を使用して、評価の作成時にカスタムメトリックに渡される行レベルの情報を格納します。

import mlflow
import pandas as pd
from databricks.agents.evals import metric
from databricks.agents.evals import judges

evals = [
  {
    "request": "Good morning",
    "response": "Good night.",
    "custom_expected": {
      "max_length": 100,
      "min_length": 3
    }
  }, {
    "request": "What is the date?",
    "response": "12/19/2024",
    "custom_expected": {
      "min_length": 10,
      "max_length": 20,
    }
  }
]

# The custom metric uses the "min_length" and "max_length" from the "custom_expected" field.
@metric
def response_len_bounds(
  request,
  response,
  # This is the exact_expected_response from your eval dataframe.
  custom_expected
):
  return len(response) <= custom_expected["max_length"] and len(response) >= custom_expected["min_length"]

with mlflow.start_run(run_name="response_len_bounds"):
    eval_results = mlflow.evaluate(
        data=pd.DataFrame.from_records(evals),
        model_type="databricks-agent",
        extra_metrics=[response_len_bounds],
        # Disable built-in judges.
        evaluator_config={
            'databricks-agent': {
                "metrics": [],
            }
        }
    )
    display(eval_results.tables['eval_results'])

トレースに対するアサーション

カスタムメトリックは、エージェントによって生成 MLflow トレースの任意の部分 (スパン、属性、出力など) を評価できます。

例: 要求分類とルーティング

次の使用例は、ユーザークエリが質問かステートメントかを判断し、ユーザーに対してプレーンな英語で返すエージェントを作成します。より現実的なシナリオでは、この手法を使用して、異なるクエリをさまざまな機能にルーティングできます。

評価セットにより、MLFlow トレースを検査するカスタムメトリックを使用して、クエリ型分類子が一連の入力に対して適切な結果を生成することが保証されます。

この例では、MLflow Trace.search_spans を使用して、 KEYWORD型を持つスパンを検索します。これは、このエージェントに対して定義したカスタムスパン型です。


import mlflow
import pandas as pd
from mlflow.types.llm import ChatCompletionResponse, ChatCompletionRequest
from databricks.agents.evals import metric
from databricks.agents.evals import judges
from mlflow.evaluation import Assessment
from mlflow.entities import Trace
from mlflow.deployments import get_deploy_client

# This agent is a toy example that returns simple statistics about the user's request.
# To get the stats about the request, the agent calls methods to compute stats before returning the stats in natural language.

deploy_client = get_deploy_client("databricks")
ENDPOINT_NAME="databricks-meta-llama-3-3-70b-instruct"

@mlflow.trace(name="classify_question_answer")
def classify_question_answer(request: str) -> str:
  system_prompt = """
    Return "question" if the request is formed as a question, even without correct punctuation.
    Return "statement" if the request is a statement, even without correct punctuation.
    Return "unknown" otherwise.

    Do not return a preamble, only return a single word.
  """
  request = {
    "messages": [
      {"role": "system", "content": system_prompt},
      {"role": "user", "content": request},
    ],
    "temperature": .01,
    "max_tokens": 1000
  }

  result = deploy_client.predict(endpoint=ENDPOINT_NAME, inputs=request)
  return result.choices[0]['message']['content']

@mlflow.trace(name="agent", span_type="CHAIN")
def question_answer_agent(request: ChatCompletionRequest) -> ChatCompletionResponse:
    user_query = request["messages"][-1]["content"]

    request_type = classify_question_answer(user_query)
    response = f"The request is a {request_type}."

    return {
        "messages": [
            *request["messages"][:-1], # Keep the chat history.
            {"role": "user", "content": response}
        ]
    }

# Define the evaluation set with a set of requests and the expected request types for those requests.
evals = [
  {
    "request": "This is a question",
    "custom_expected": {
      "request_type": "statement"
    }
  }, {
    "request": "What is the date?",
    "custom_expected": {
      "request_type": "question"
    }
  },
]

# The custom metric checks the expected request type against the actual request type produced by the agent trace.
@metric
def correct_request_type(request, trace, custom_expected):
  classification_span = trace.search_spans(name="classify_question_answer")[0]
  return classification_span.outputs == custom_expected['request_type']

with mlflow.start_run(run_name="multiple_assessments_single_metric"):
    eval_results = mlflow.evaluate(
        data=pd.DataFrame.from_records(evals),
        model=question_answer_agent,
        model_type="databricks-agent",
        extra_metrics=[correct_request_type],
        evaluator_config={
            'databricks-agent': {
                "metrics": [],
            }
        }
    )
    display(eval_results.tables['eval_results'])

これらの例を活用することで、独自の評価ニーズを満たすようにカスタムメトリックを設計できます。

ツール呼び出しの評価

カスタムメトリックは、呼び出されたツールと返されたツールに関する情報を提供する tool_calls の一覧であるで提供されます。

例: 適切なツールのアサートを呼び出す

手記

この例では、LangGraph エージェントが定義されていないため、コピー貼り付けできません。完全に実行可能な例については、添付されたノートブックを参照してください。

import mlflow
import pandas as pd
from databricks.agents.evals import metric
from databricks.agents.evals import judges

eval_data = pd.DataFrame(
  [
    {
      "request": "what is 3 * 12?",
      "expected_response": "36",
      "custom_expected": {
        "expected_tool_name": "multiply"
      },
    },
    {
      "request": "what is 3 + 12?",
      "expected_response": "15",
      "custom_expected": {
        "expected_tool_name": "add"
      },
    },
  ]
)

@metric
def is_correct_tool(tool_calls, custom_expected):
  # Metric to check whether the first tool call is the expected tool
  return tool_calls[0].tool_name == custom_expected["expected_tool_name"]

@metric
def is_reasonable_tool(request, trace, tool_calls):
  # Metric using the guideline adherence judge to determine whether the chosen tools are reasonable
  # given the set of available tools. Note that `guidelines_context` requires `databricks-agents >= 0.20.0`

  return judges.guideline_adherence(
    request=request["messages"][0]["content"],
    guidelines=[
      "The selected tool must be a reasonable tool call with respect to the request and available tools.",
    ],
    # `guidelines_context` requires `databricks-agents>=0.20.0`
    guidelines_context={
      "available_tools": str(tool_calls[0].available_tools),
      "chosen_tools": str([tool_call.tool_name for tool_call in tool_calls]),
    },
  )

results = mlflow.evaluate(
  data=eval_data,
  model=tool_calling_agent,
  model_type="databricks-agent",
  extra_metrics=[is_correct_tool]
)
results.tables["eval_results"].display()

カスタムメトリックの開発

メトリックを開発するときは、変更を加えるたびにエージェントを実行しなくても、メトリックをすばやく反復処理する必要があります。これを簡単にするには、次の方法を使用します。

評価データセットエージェントから回答シートを生成します。これにより、評価セット内の各エントリに対してエージェントが実行され、メトリックの呼び出しを直接使用できる応答とトレースが生成されます。
メトリックを定義します。
回答シート内の各値のメトリックを直接呼び出し、メトリック定義を反復処理します。
メトリックが期待どおりに動作している場合は、同じ回答シートで mlflow.evaluate() 実行して、エージェント評価の実行結果が期待どおりであることを確認します。この例のコードでは model= フィールドを使用しないため、評価では事前に計算された応答が使用されます。
メトリックのパフォーマンスに満足したら、model= の mlflow.evaluate() フィールドを有効にして、エージェントを対話形式で呼び出します。

import mlflow
import pandas as pd
from databricks.agents.evals import metric
from databricks.agents.evals import judges
from mlflow.evaluation import Assessment
from mlflow.entities import Trace

evals = [
  {
    "request": "What is Databricks?",
    "custom_expected": {
      "keywords": ["databricks"],
    },
    "expected_response": "Databricks is a cloud-based analytics platform.",
    "expected_facts": ["Databricks is a cloud-based analytics platform."],
    "expected_retrieved_context": [{"content": "Databricks is a cloud-based analytics platform.", "doc_uri": "https://databricks.com/doc_uri"}]
  }, {
    "request": "When was Databricks founded?",
    "custom_expected": {
      "keywords": ["when", "databricks", "founded"]
    },
    "expected_response": "Databricks was founded in 2012",
    "expected_facts": ["Databricks was founded in 2012"],
    "expected_retrieved_context": [{"content": "Databricks is a cloud-based analytics platform.", "doc_uri": "https://databricks.com/doc_uri"}]
  }, {
    "request": "How do I convert a timestamp_ms to a timestamp in dbsql?",
    "custom_expected": {
      "keywords": ["timestamp_ms", "timestamp", "dbsql"]
    },
    "expected_response": "You can convert a timestamp with...",
    "expected_facts": ["You can convert a timestamp with..."],
    "expected_retrieved_context": [{"content": "You can convert a timestamp with...", "doc_uri": "https://databricks.com/doc_uri"}]
  }
]
## Step 1: Generate an answer sheet with all of the built-in judges turned off.
## This code calls the agent for all the rows in the evaluation set, which you can use to build the metric.
answer_sheet_df = mlflow.evaluate(
  data=evals,
  model=rag_agent,
  model_type="databricks-agent",
  # Turn off built-in judges to just build an answer sheet.
  evaluator_config={"databricks-agent": {"metrics": []}
  }
).tables['eval_results']
display(answer_sheet_df)

answer_sheet = answer_sheet_df.to_dict(orient='records')

## Step 2: Define the metric.
@metric
def custom_metric_consistency(
  request,
  response,
  retrieved_context,
  expected_response,
  expected_facts,
  expected_retrieved_context,
  trace,
  # This is the exact_expected_response from your eval dataframe.
  custom_expected
):
  print(f"[custom_metric] request: {request}")
  print(f"[custom_metric] response: {response}")
  print(f"[custom_metric] retrieved_context: {retrieved_context}")
  print(f"[custom_metric] expected_response: {expected_response}")
  print(f"[custom_metric] expected_facts: {expected_facts}")
  print(f"[custom_metric] expected_retrieved_context: {expected_retrieved_context}")
  print(f"[custom_metric] trace: {trace}")

  return True

## Step 3: Call the metric directly before using the evaluation harness to iterate on the metric definition.
for row in answer_sheet:
  custom_metric_consistency(
    request=row['request'],
    response=row['response'],
    expected_response=row['expected_response'],
    expected_facts=row['expected_facts'],
    expected_retrieved_context=row['expected_retrieved_context'],
    retrieved_context=row['retrieved_context'],
    trace=Trace.from_json(row['trace']),
    custom_expected=row['custom_expected']
  )

## Step 4: After you are confident in the signature of the metric, you can run the harness with the answer sheet to trigger the output validation and make sure the UI reflects what you intended.
with mlflow.start_run(run_name="exact_expected_response"):
    eval_results = mlflow.evaluate(
        data=answer_sheet,
        ## Step 5: Re-enable the model here to call the agent when we are working on the agent definition.
        # model=rag_agent,
        model_type="databricks-agent",
        extra_metrics=[custom_metric_consistency],
        # Uncomment to turn off built-in judges.
        # evaluator_config={
        #     'databricks-agent': {
        #         "metrics": [],
        #     }
        # }
    )
    display(eval_results.tables['eval_results'])

例のノートブック

次のノートブック例は、Mosaic AI Agent Evaluation でカスタムメトリックを使用するいくつかの異なる方法を示しています。

サンプルノートブック: エージェント評価のカスタム指標

ノートブックを取得する