Share via


Creating custom scorers

Overview

Custom scorers offer the ultimate flexibility to define precisely how your GenAI application's quality is measured. Custom scorers provide flexibility to define evaluation metrics tailored to your specific business use case, whether based on simple heuristics, advanced logic, or programmatic evaluations.

Use custom scorers for the following scenarios:

  1. Defining a custom hueristic or code-based evaluation metric
  2. Customizing how the data from your app's trace is mapped to Databricks' research-backed LLM judges in the predefined LLM scorers
  3. Creating an LLM judge with a custom prompt text using the prompt-based LLM scorers article.
  4. Using your own LLM model (rather than a Databricks-hosted LLM judge model) for evaluation
  5. Any other use cases where you need more flexibility and control than provided by the predefined abstractions

Note

Refer to the scorer concept page or to the API docs for a detailed reference on the custom scorer interfaces.

Usage overview

Custom scorers are written in Python and give you full control to evaluate any data from your app's traces. A single custom scorer works in both evaluate(...) harness for offline evaluation or if passed to create_monitor(...) for production monitoring.

The following outputs types are supported:

  • Pass/fail string: "yes" or "no" string values render as "Pass" or "Fail" in the UI.
  • Numeric value: Ordinal values: integers or floats.
  • Boolean value: True or False.
  • Feedback object: Return a Feedback object with a score, rationale, and additional metadata

As input, custom scorers have access to:

  • The complete MLflow trace, including spans, attributes, and outputs. The trace is passed to the custom scorer as an instantiated mlflow.entities.trace class.
  • The inputs dictionary, derived from either input datasett or MLflow post-processes from your trace.
  • The outputs value, derived from either input dataset or trace. If predict_fn is provided, the outputs value will be the return of the predict_fn.
  • The expectations dictionary, derived from the expectations field in the input dataset, or assessments associated with the trace.

The @scorer decorator allows users to define custom evaluation metrics that can be passed into mlflow.genai.evaluate() using the scorers argument or create_monitor(...).

The scorer function is invoked with named arguments based on the signature below. All named arguments are optional so you can use any combination. For example, you could define a scorer that only have inputs and trace as arguments and omit outputs and expectations:

from mlflow.genai.scorers import scorer
from typing import Optional, Any
from mlflow.entities import Feedback

@scorer
def my_custom_scorer(
  *,  # evaluate(...) harness will always call your scorer with named arguments
  inputs: Optional[dict[str, Any]],  # The agent's raw input, parsed from the Trace or dataset, as a Python dict
  outputs: Optional[Any],  # The agent's raw output, parsed from the Trace or
  expectations: Optional[dict[str, Any]],  # The expectations passed to evaluate(data=...), as a Python dict
  trace: Optional[mlflow.entities.Trace] # The app's resulting Trace containing spans and other metadata
) -> int | float | bool | str | Feedback | list[Feedback]

Custom scorer development approach

As you develop metrics, you need to quickly iterate on the metric without having to execute your app every time you make a change to the scorer. To do this, we recommend the following steps:

Step 1: Define your initial metric, app, and evaluation data

import mlflow
from mlflow.entities import Trace
from mlflow.genai.scorers import scorer
from typing import Any

@mlflow.trace
def my_app(input_field_name: str):
    return {'output': input_field_name+'_output'}

@scorer
def my_metric() -> int:
    # placeholder return value
    return 1

eval_set = [{'inputs': {'input_field_name': 'test'}}]

Step 2: Generate traces from your app using evaluate()

eval_results = mlflow.genai.evaluate(
    data=eval_set,
    predict_fn=my_app,
    scorers=[dummy_metric]
)

Step 3: Query and store the resulting traces

generated_traces = mlflow.search_traces(run_id=eval_results.run_id)

Step 4: Pass the resulting traces as input to evaluate() as you iterate on your metric

The search_traces function returns a Pandas DataFrame of traces, which you can pass directly to evaluate() as an input dataset. This allows you to quickly iterate on your metric without having to re-run your app.

@scorer
def my_metric(outputs: Any):
    # Implement the actual metric logic here.
    return outputs == "test_output"

# Note the lack of a predict_fn parameter
mlflow.genai.evaluate(
    data=generated_traces,
    scorers=[my_metric],
)

Custom scorer examples

In this guide, we will show you various approaches to building custom scorers.

Custom Scorer Development

Prerequisite: create a sample application and get a local copy of the traces

Across all of the approaches, we use the below sample application and copy of the traces (extracted using the above approach).

import mlflow
from openai import OpenAI
from typing import Any
from mlflow.entities import Trace
from mlflow.genai.scorers import scorer

# Enable auto logging for OpenAI
mlflow.openai.autolog()

# Connect to a Databricks LLM via OpenAI using the same credentials as MLflow
# Alternatively, you can use your own OpenAI credentials here
mlflow_creds = mlflow.utils.databricks_utils.get_databricks_host_creds()
client = OpenAI(
    api_key=mlflow_creds.token,
    base_url=f"{mlflow_creds.host}/serving-endpoints"
)

@mlflow.trace
def sample_app(messages: list[dict[str, str]]):
    # 1. Prepare messages for the LLM
    messages_for_llm = [
        {"role": "system", "content": "You are a helpful assistant."},
        *messages,
    ]

    # 2. Call LLM to generate a response
    response = client.chat.completions.create(
        model="databricks-claude-3-7-sonnet",  # This example uses Databricks hosted Claude. If you provide your own OpenAI credentials, replace with a valid OpenAI model e.g., gpt-4o, etc.
        messages=messages_for_llm,
    )
    return response.choices[0].message.content


# Create a list of messages for the LLM to generate a response
eval_dataset = [
    {
        "inputs": {
            "messages": [
                {"role": "user", "content": "How much does a microwave cost?"},
            ]
        },
    },
    {
        "inputs": {
            "messages": [
                {
                    "role": "user",
                    "content": "Can I return the microwave I bought 2 months ago?",
                },
            ]
        },
    },
    {
        "inputs": {
            "messages": [
                {
                    "role": "user",
                    "content": "I'm having trouble with my account.  I can't log in.",
                },
                {
                    "role": "assistant",
                    "content": "I'm sorry to hear that you're having trouble with your account.  Are you using our website or mobile app?",
                },
                {"role": "user", "content": "Website"},
            ]
        },
    },
]


@scorer
def dummy_metric():
    # This scorer is just to help generate initial traces.
    return 1


# Generate initial traces by running the sample_app.
# The results, including traces, are logged to the MLflow experiment defined above.
initial_eval_results = mlflow.genai.evaluate(
    data=eval_dataset, predict_fn=sample_app, scorers=[dummy_metric]
)

generated_traces = mlflow.search_traces(run_id=initial_eval_results.run_id)

After running the above code, you should have three traces in your experiment.

Generated Sample Traces

Example 1: Accessing data from the trace

Access the full MLflow Trace object to use various details (spans, inputs, outputs, attributes, timing) for fine-grained metric calculation.

Note

The generated_traces from the prerequisite section will be used as input data for these examples.

This scorer checks if the total execution time of the trace is within an acceptable range.

import mlflow
from mlflow.genai.scorers import scorer
from mlflow.entities import Trace, Feedback, SpanType

@scorer
def llm_response_time_good(trace: Trace) -> Feedback:
    # Search particular span type from the trace
    llm_span = trace.search_spans(span_type=SpanType.CHAT_MODEL)[0]

    response_time = (llm_span.end_time_ns - llm_span.start_time_ns) / 1e9 # second
    max_duration = 5.0
    if response_time <= max_duration:
        return Feedback(
            value="yes",
            rationale=f"LLM response time {response_time:.2f}s is within the {max_duration}s limit."
        )
    else:
        return Feedback(
            value="no",
            rationale=f"LLM response time {response_time:.2f}s exceeds the {max_duration}s limit."
        )

# Evaluate the scorer using the pre-generated traces from the prerequisite code block.
span_check_eval_results = mlflow.genai.evaluate(
    data=generated_traces,
    scorers=[llm_response_time_good]
)

Example 2: Wrapping a predefined LLM judge

Create a custom scorer that wraps MLflow's predefined LLM judges. Use this to pre-process trace data for the judge or post-process its feedback.

This example demonstrates how to wrap the is_context_relevant judge that evaluate whether the given context is relevant to the query, to evaluate whether the assistant's response is relevant to the user's query.

import mlflow
from mlflow.entities import Trace, Feedback
from mlflow.genai.judges import is_context_relevant
from mlflow.genai.scorers import scorer
from typing import Any

# Assume `generated_traces` is available from the prerequisite code block.

@scorer
def is_message_relevant(inputs: dict[str, Any], outputs: str) -> Feedback:
    # The `inputs` field for `sample_app` is a dictionary like: {"messages": [{"role": ..., "content": ...}, ...]}
    # We need to extract the content of the last user message to pass to the relevance judge.

    last_user_message_content = None
    if "messages" in inputs and isinstance(inputs["messages"], list):
        for message in reversed(inputs["messages"]):
            if message.get("role") == "user" and "content" in message:
                last_user_message_content = message["content"]
                break

    if not last_user_message_content:
        raise Exception("Could not extract the last user message from inputs to evaluate relevance.")

    # Call the `relevance_to_query judge. It will return a Feedback object.
    return is_context_relevant(
        request=last_user_message_content,
        context={"response": outputs},
    )

# Evaluate the custom relevance scorer
custom_relevance_eval_results = mlflow.genai.evaluate(
    data=generated_traces,
    scorers=[is_message_relevant]
)

Example 3: Using expectations

When mlflow.genai.evaluate() is called with a data argument that is a list of dictionaries or a Pandas DataFrame, each row can contain an expectations key. The value associated with this key is passed directly to your custom scorer.

import mlflow
from mlflow.entities import Feedback
from mlflow.genai.scorers import scorer
from typing import Any, List, Optional, Union

expectations_eval_dataset_list = [
    {
        "inputs": {"messages": [{"role": "user", "content": "What is 2+2?"}]},
        "expectations": {
            "expected_response": "2+2 equals 4.",
            "expected_keywords": ["4", "four", "equals"],
        }
    },
    {
        "inputs": {"messages": [{"role": "user", "content": "Describe MLflow in one sentence."}]},
        "expectations": {
            "expected_response": "MLflow is an open-source platform to streamline machine learning development, including tracking experiments, packaging code into reproducible runs, and sharing and deploying models.",
            "expected_keywords": ["mlflow", "open-source", "platform", "machine learning"],
        }
    },
    {
        "inputs": {"messages": [{"role": "user", "content": "Say hello."}]},
        "expectations": {
            "expected_response": "Hello there!",
            # No keywords needed for this one, but the field can be omitted or empty
        }
    }
]

Example 3.1: Exact Match with Expected Response

This scorer checks if the assistant's response exactly matches the expected_response provided in the expectations.

@scorer
def exact_match(outputs: str, expectations: dict[str, Any]) -> bool:
    # Scorer can return primitive value like bool, int, float, str, etc.
    return outputs == expectations["expected_response"]

exact_match_eval_results = mlflow.genai.evaluate(
    data=expectations_eval_dataset_list,
    predict_fn=sample_app, # sample_app is from the prerequisite section
    scorers=[exact_match]
)

Example 3.2: Keyword Presence Check from Expectations

This scorer checks if all expected_keywords from the expectations are present in the assistant's response.

@scorer
def keyword_presence_scorer(outputs: str, expectations: dict[str, Any]) -> Feedback:
    expected_keywords = expectations.get("expected_keywords")
    print(expected_keywords)
    if expected_keywords is None:
        return Feedback(
            score=None, # Undetermined, as no keywords were expected
            rationale="No 'expected_keywords' provided in expectations."
        )

    missing_keywords = []
    for keyword in expected_keywords:
        if keyword.lower() not in outputs.lower():
            missing_keywords.append(keyword)

    if not missing_keywords:
        return Feedback(value="yes", rationale="All expected keywords are present in the response.")
    else:
        return Feedback(value="no", rationale=f"Missing keywords: {', '.join(missing_keywords)}.")

keyword_presence_eval_results = mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=sample_app, # sample_app is from the prerequisite section
    scorers=[keyword_presence_scorer]
)

Example 4: Returning multiple feedback objects

A single scorer can return a list of Feedback objects, allowing one scorer to assess multiple quality facets (e.g., PII, sentiment, conciseness) simultaneously. Each Feedback object should ideally have a unique name (which becomes the metric name in the results); otherwise, they might overwrite each other if names are auto-generated and collide. If a name is not provided, MLflow will attempt to generate one based on the scorer function name and an index.

This example demonstrates a scorer that returns two distinct pieces of feedback for each trace:

  1. is_not_empty_check: A boolean indicating if the response content is non-empty.
  2. response_char_length: A numeric value for the character length of the response.
import mlflow
from mlflow.genai.scorers import scorer
from mlflow.entities import Feedback, Trace # Ensure Feedback and Trace are imported
from typing import Any, Optional

# Assume `generated_traces` is available from the prerequisite code block.

@scorer
def comprehensive_response_checker(outputs: str) -> list[Feedback]:
    feedbacks = []
    # 1. Check if the response is not empty
    feedbacks.append(
        Feedback(name="is_not_empty_check", value="yes" if outputs != "" else "no")
    )
    # 2. Calculate response character length
    char_length = len(outputs)
    feedbacks.append(Feedback(name="response_char_length", value=char_length))
    return feedbacks

multi_feedback_eval_results = mlflow.genai.evaluate(
    data=generated_traces,
    scorers=[comprehensive_response_checker]
)

The result will have two columns: is_not_empty_check and response_char_length as assessments.

Multi-feedback results

Example 5: Using your own LLM for a judge

Integrate a custom or externally hosted LLM within a scorer. The scorer handles API calls, input/output formatting, and generates Feedback from your LLM's response, giving full control over the judging process.

You can also set the source field in the Feedback object to indicate the source of the assessment is an LLM judge.

import mlflow
import json
from mlflow.genai.scorers import scorer
from mlflow.entities import AssessmentSource, AssessmentSourceType, Feedback
from typing import Any, Optional


# Assume `generated_traces` is available from the prerequisite code block.
# Assume `client` (OpenAI SDK client configured for Databricks) is available from the prerequisite block.
# client = OpenAI(...)

# Define the prompts for the Judge LLM.
judge_system_prompt = """
You are an impartial AI assistant responsible for evaluating the quality of a response generated by another AI model.
Your evaluation should be based on the original user query and the AI's response.
Provide a quality score as an integer from 1 to 5 (1=Poor, 2=Fair, 3=Good, 4=Very Good, 5=Excellent).
Also, provide a brief rationale for your score.

Your output MUST be a single valid JSON object with two keys: "score" (an integer) and "rationale" (a string).
Example:
{"score": 4, "rationale": "The response was mostly accurate and helpful, addressing the user's query directly."}
"""
judge_user_prompt = """
Please evaluate the AI's Response below based on the Original User Query.

Original User Query:
```{user_query}```

AI's Response:
```{llm_response_from_app}```

Provide your evaluation strictly as a JSON object with "score" and "rationale" keys.
"""

@scorer
def answer_quality(inputs: dict[str, Any], outputs: str) -> Feedback:
    user_query = inputs["messages"][-1]["content"]

    # Call the Judge LLM using the OpenAI SDK client.
    judge_llm_response_obj = client.chat.completions.create(
        model="databricks-claude-3-7-sonnet",  # This example uses Databricks hosted Claude. If you provide your own OpenAI credentials, replace with a valid OpenAI model e.g., gpt-4o-mini, etc.
        messages=[
            {"role": "system", "content": judge_system_prompt},
            {"role": "user", "content": judge_user_prompt.format(user_query=user_query, llm_response_from_app=outputs)},
        ],
        max_tokens=200,  # Max tokens for the judge's rationale
        temperature=0.0, # For more deterministic judging
    )
    judge_llm_output_text = judge_llm_response_obj.choices[0].message.content

    # Parse the Judge LLM's JSON output.
    judge_eval_json = json.loads(judge_llm_output_text)
    parsed_score = int(judge_eval_json["score"])
    parsed_rationale = judge_eval_json["rationale"]

    return Feedback(
        value=parsed_score,
        rationale=parsed_rationale,
        # Set the source of the assessment to indicate the LLM judge used to generate the feedback
        source=AssessmentSource(
            source_type=AssessmentSourceType.LLM_JUDGE,
            source_id="claude-3-7-sonnet",
        )
    )


# Evaluate the scorer using the pre-generated traces.
custom_llm_judge_eval_results = mlflow.genai.evaluate(
    data=generated_traces,
    scorers=[answer_quality]
)

By opening the trace in the UI and click on the "answer_quality" assessment, you can see the metadata for the judge, such as rationale, timestamp, judge model name, etc. If the judge assessment is not correct, you can override the score by clicking Edit button.

The new assessment will supersedes the original judge assessment, but the edit history will be preserved for future reference.

Edit LLM Judge Assessment

Next steps

Continue your journey with these recommended actions and tutorials.

Reference guides

Explore detailed documentation for concepts and features mentioned in this guide.

  • Scorers - Deep dive into how scorers work and their architecture
  • Evaluation Harness - Understand how mlflow.genai.evaluate() uses your scorers
  • LLM judges - Learn the foundation for AI-powered evaluation