Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Overview
Custom scorers offer the ultimate flexibility to define precisely how your GenAI application's quality is measured. Custom scorers provide flexibility to define evaluation metrics tailored to your specific business use case, whether based on simple heuristics, advanced logic, or programmatic evaluations.
Use custom scorers for the following scenarios:
- Defining a custom hueristic or code-based evaluation metric
- Customizing how the data from your app's trace is mapped to Databricks' research-backed LLM judges in the predefined LLM scorers
- Creating an LLM judge with a custom prompt text using the prompt-based LLM scorers article.
- Using your own LLM model (rather than a Databricks-hosted LLM judge model) for evaluation
- Any other use cases where you need more flexibility and control than provided by the predefined abstractions
Note
Refer to the scorer concept page or to the API docs for a detailed reference on the custom scorer interfaces.
Usage overview
Custom scorers are written in Python and give you full control to evaluate any data from your app's traces. A single custom scorer works in both evaluate(...)
harness for offline evaluation or if passed to create_monitor(...)
for production monitoring.
The following outputs types are supported:
- Pass/fail string:
"yes" or "no"
string values render as "Pass" or "Fail" in the UI. - Numeric value: Ordinal values: integers or floats.
- Boolean value:
True
orFalse
. - Feedback object: Return a
Feedback
object with a score, rationale, and additional metadata
As input, custom scorers have access to:
- The complete MLflow trace, including spans, attributes, and outputs. The trace is passed to the custom scorer as an instantiated
mlflow.entities.trace
class. - The
inputs
dictionary, derived from either input datasett or MLflow post-processes from your trace. - The
outputs
value, derived from either input dataset or trace. Ifpredict_fn
is provided, theoutputs
value will be the return of thepredict_fn
. - The
expectations
dictionary, derived from theexpectations
field in the input dataset, or assessments associated with the trace.
The @scorer
decorator allows users to define custom evaluation metrics that can be passed into mlflow.genai.evaluate()
using the scorers
argument or create_monitor(...)
.
The scorer function is invoked with named arguments based on the signature below. All named arguments are optional so you can use any combination. For example, you could define a scorer that only have inputs
and trace
as arguments and omit outputs
and expectations
:
from mlflow.genai.scorers import scorer
from typing import Optional, Any
from mlflow.entities import Feedback
@scorer
def my_custom_scorer(
*, # evaluate(...) harness will always call your scorer with named arguments
inputs: Optional[dict[str, Any]], # The agent's raw input, parsed from the Trace or dataset, as a Python dict
outputs: Optional[Any], # The agent's raw output, parsed from the Trace or
expectations: Optional[dict[str, Any]], # The expectations passed to evaluate(data=...), as a Python dict
trace: Optional[mlflow.entities.Trace] # The app's resulting Trace containing spans and other metadata
) -> int | float | bool | str | Feedback | list[Feedback]
Custom scorer development approach
As you develop metrics, you need to quickly iterate on the metric without having to execute your app every time you make a change to the scorer. To do this, we recommend the following steps:
Step 1: Define your initial metric, app, and evaluation data
import mlflow
from mlflow.entities import Trace
from mlflow.genai.scorers import scorer
from typing import Any
@mlflow.trace
def my_app(input_field_name: str):
return {'output': input_field_name+'_output'}
@scorer
def my_metric() -> int:
# placeholder return value
return 1
eval_set = [{'inputs': {'input_field_name': 'test'}}]
Step 2: Generate traces from your app using evaluate()
eval_results = mlflow.genai.evaluate(
data=eval_set,
predict_fn=my_app,
scorers=[dummy_metric]
)
Step 3: Query and store the resulting traces
generated_traces = mlflow.search_traces(run_id=eval_results.run_id)
Step 4: Pass the resulting traces as input to evaluate()
as you iterate on your metric
The search_traces
function returns a Pandas DataFrame of traces, which you can pass directly to evaluate()
as an input dataset. This allows you to quickly iterate on your metric without having to re-run your app.
@scorer
def my_metric(outputs: Any):
# Implement the actual metric logic here.
return outputs == "test_output"
# Note the lack of a predict_fn parameter
mlflow.genai.evaluate(
data=generated_traces,
scorers=[my_metric],
)
Custom scorer examples
In this guide, we will show you various approaches to building custom scorers.
Prerequisite: create a sample application and get a local copy of the traces
Across all of the approaches, we use the below sample application and copy of the traces (extracted using the above approach).
import mlflow
from openai import OpenAI
from typing import Any
from mlflow.entities import Trace
from mlflow.genai.scorers import scorer
# Enable auto logging for OpenAI
mlflow.openai.autolog()
# Connect to a Databricks LLM via OpenAI using the same credentials as MLflow
# Alternatively, you can use your own OpenAI credentials here
mlflow_creds = mlflow.utils.databricks_utils.get_databricks_host_creds()
client = OpenAI(
api_key=mlflow_creds.token,
base_url=f"{mlflow_creds.host}/serving-endpoints"
)
@mlflow.trace
def sample_app(messages: list[dict[str, str]]):
# 1. Prepare messages for the LLM
messages_for_llm = [
{"role": "system", "content": "You are a helpful assistant."},
*messages,
]
# 2. Call LLM to generate a response
response = client.chat.completions.create(
model="databricks-claude-3-7-sonnet", # This example uses Databricks hosted Claude. If you provide your own OpenAI credentials, replace with a valid OpenAI model e.g., gpt-4o, etc.
messages=messages_for_llm,
)
return response.choices[0].message.content
# Create a list of messages for the LLM to generate a response
eval_dataset = [
{
"inputs": {
"messages": [
{"role": "user", "content": "How much does a microwave cost?"},
]
},
},
{
"inputs": {
"messages": [
{
"role": "user",
"content": "Can I return the microwave I bought 2 months ago?",
},
]
},
},
{
"inputs": {
"messages": [
{
"role": "user",
"content": "I'm having trouble with my account. I can't log in.",
},
{
"role": "assistant",
"content": "I'm sorry to hear that you're having trouble with your account. Are you using our website or mobile app?",
},
{"role": "user", "content": "Website"},
]
},
},
]
@scorer
def dummy_metric():
# This scorer is just to help generate initial traces.
return 1
# Generate initial traces by running the sample_app.
# The results, including traces, are logged to the MLflow experiment defined above.
initial_eval_results = mlflow.genai.evaluate(
data=eval_dataset, predict_fn=sample_app, scorers=[dummy_metric]
)
generated_traces = mlflow.search_traces(run_id=initial_eval_results.run_id)
After running the above code, you should have three traces in your experiment.
Example 1: Accessing data from the trace
Access the full MLflow Trace object to use various details (spans, inputs, outputs, attributes, timing) for fine-grained metric calculation.
Note
The generated_traces
from the prerequisite section will be used as input data for these examples.
This scorer checks if the total execution time of the trace is within an acceptable range.
import mlflow
from mlflow.genai.scorers import scorer
from mlflow.entities import Trace, Feedback, SpanType
@scorer
def llm_response_time_good(trace: Trace) -> Feedback:
# Search particular span type from the trace
llm_span = trace.search_spans(span_type=SpanType.CHAT_MODEL)[0]
response_time = (llm_span.end_time_ns - llm_span.start_time_ns) / 1e9 # second
max_duration = 5.0
if response_time <= max_duration:
return Feedback(
value="yes",
rationale=f"LLM response time {response_time:.2f}s is within the {max_duration}s limit."
)
else:
return Feedback(
value="no",
rationale=f"LLM response time {response_time:.2f}s exceeds the {max_duration}s limit."
)
# Evaluate the scorer using the pre-generated traces from the prerequisite code block.
span_check_eval_results = mlflow.genai.evaluate(
data=generated_traces,
scorers=[llm_response_time_good]
)
Example 2: Wrapping a predefined LLM judge
Create a custom scorer that wraps MLflow's predefined LLM judges. Use this to pre-process trace data for the judge or post-process its feedback.
This example demonstrates how to wrap the is_context_relevant
judge that evaluate whether the given context is relevant to the query, to evaluate whether the assistant's response is relevant to the user's query.
import mlflow
from mlflow.entities import Trace, Feedback
from mlflow.genai.judges import is_context_relevant
from mlflow.genai.scorers import scorer
from typing import Any
# Assume `generated_traces` is available from the prerequisite code block.
@scorer
def is_message_relevant(inputs: dict[str, Any], outputs: str) -> Feedback:
# The `inputs` field for `sample_app` is a dictionary like: {"messages": [{"role": ..., "content": ...}, ...]}
# We need to extract the content of the last user message to pass to the relevance judge.
last_user_message_content = None
if "messages" in inputs and isinstance(inputs["messages"], list):
for message in reversed(inputs["messages"]):
if message.get("role") == "user" and "content" in message:
last_user_message_content = message["content"]
break
if not last_user_message_content:
raise Exception("Could not extract the last user message from inputs to evaluate relevance.")
# Call the `relevance_to_query judge. It will return a Feedback object.
return is_context_relevant(
request=last_user_message_content,
context={"response": outputs},
)
# Evaluate the custom relevance scorer
custom_relevance_eval_results = mlflow.genai.evaluate(
data=generated_traces,
scorers=[is_message_relevant]
)
Example 3: Using expectations
When mlflow.genai.evaluate()
is called with a data
argument that is a list of dictionaries or a Pandas DataFrame, each row can contain an expectations
key. The value associated with this key is passed directly to your custom scorer.
import mlflow
from mlflow.entities import Feedback
from mlflow.genai.scorers import scorer
from typing import Any, List, Optional, Union
expectations_eval_dataset_list = [
{
"inputs": {"messages": [{"role": "user", "content": "What is 2+2?"}]},
"expectations": {
"expected_response": "2+2 equals 4.",
"expected_keywords": ["4", "four", "equals"],
}
},
{
"inputs": {"messages": [{"role": "user", "content": "Describe MLflow in one sentence."}]},
"expectations": {
"expected_response": "MLflow is an open-source platform to streamline machine learning development, including tracking experiments, packaging code into reproducible runs, and sharing and deploying models.",
"expected_keywords": ["mlflow", "open-source", "platform", "machine learning"],
}
},
{
"inputs": {"messages": [{"role": "user", "content": "Say hello."}]},
"expectations": {
"expected_response": "Hello there!",
# No keywords needed for this one, but the field can be omitted or empty
}
}
]
Example 3.1: Exact Match with Expected Response
This scorer checks if the assistant's response exactly matches the expected_response
provided in the expectations
.
@scorer
def exact_match(outputs: str, expectations: dict[str, Any]) -> bool:
# Scorer can return primitive value like bool, int, float, str, etc.
return outputs == expectations["expected_response"]
exact_match_eval_results = mlflow.genai.evaluate(
data=expectations_eval_dataset_list,
predict_fn=sample_app, # sample_app is from the prerequisite section
scorers=[exact_match]
)
Example 3.2: Keyword Presence Check from Expectations
This scorer checks if all expected_keywords
from the expectations
are present in the assistant's response.
@scorer
def keyword_presence_scorer(outputs: str, expectations: dict[str, Any]) -> Feedback:
expected_keywords = expectations.get("expected_keywords")
print(expected_keywords)
if expected_keywords is None:
return Feedback(
score=None, # Undetermined, as no keywords were expected
rationale="No 'expected_keywords' provided in expectations."
)
missing_keywords = []
for keyword in expected_keywords:
if keyword.lower() not in outputs.lower():
missing_keywords.append(keyword)
if not missing_keywords:
return Feedback(value="yes", rationale="All expected keywords are present in the response.")
else:
return Feedback(value="no", rationale=f"Missing keywords: {', '.join(missing_keywords)}.")
keyword_presence_eval_results = mlflow.genai.evaluate(
data=eval_dataset,
predict_fn=sample_app, # sample_app is from the prerequisite section
scorers=[keyword_presence_scorer]
)
Example 4: Returning multiple feedback objects
A single scorer can return a list of Feedback
objects, allowing one scorer to assess multiple quality facets (e.g., PII, sentiment, conciseness) simultaneously. Each Feedback
object should ideally have a unique name
(which becomes the metric name in the results); otherwise, they might overwrite each other if names are auto-generated and collide. If a name is not provided, MLflow will attempt to generate one based on the scorer function name and an index.
This example demonstrates a scorer that returns two distinct pieces of feedback for each trace:
is_not_empty_check
: A boolean indicating if the response content is non-empty.response_char_length
: A numeric value for the character length of the response.
import mlflow
from mlflow.genai.scorers import scorer
from mlflow.entities import Feedback, Trace # Ensure Feedback and Trace are imported
from typing import Any, Optional
# Assume `generated_traces` is available from the prerequisite code block.
@scorer
def comprehensive_response_checker(outputs: str) -> list[Feedback]:
feedbacks = []
# 1. Check if the response is not empty
feedbacks.append(
Feedback(name="is_not_empty_check", value="yes" if outputs != "" else "no")
)
# 2. Calculate response character length
char_length = len(outputs)
feedbacks.append(Feedback(name="response_char_length", value=char_length))
return feedbacks
multi_feedback_eval_results = mlflow.genai.evaluate(
data=generated_traces,
scorers=[comprehensive_response_checker]
)
The result will have two columns: is_not_empty_check
and response_char_length
as assessments.
Example 5: Using your own LLM for a judge
Integrate a custom or externally hosted LLM within a scorer. The scorer handles API calls, input/output formatting, and generates Feedback
from your LLM's response, giving full control over the judging process.
You can also set the source
field in the Feedback
object to indicate the source of the assessment is an LLM judge.
import mlflow
import json
from mlflow.genai.scorers import scorer
from mlflow.entities import AssessmentSource, AssessmentSourceType, Feedback
from typing import Any, Optional
# Assume `generated_traces` is available from the prerequisite code block.
# Assume `client` (OpenAI SDK client configured for Databricks) is available from the prerequisite block.
# client = OpenAI(...)
# Define the prompts for the Judge LLM.
judge_system_prompt = """
You are an impartial AI assistant responsible for evaluating the quality of a response generated by another AI model.
Your evaluation should be based on the original user query and the AI's response.
Provide a quality score as an integer from 1 to 5 (1=Poor, 2=Fair, 3=Good, 4=Very Good, 5=Excellent).
Also, provide a brief rationale for your score.
Your output MUST be a single valid JSON object with two keys: "score" (an integer) and "rationale" (a string).
Example:
{"score": 4, "rationale": "The response was mostly accurate and helpful, addressing the user's query directly."}
"""
judge_user_prompt = """
Please evaluate the AI's Response below based on the Original User Query.
Original User Query:
```{user_query}```
AI's Response:
```{llm_response_from_app}```
Provide your evaluation strictly as a JSON object with "score" and "rationale" keys.
"""
@scorer
def answer_quality(inputs: dict[str, Any], outputs: str) -> Feedback:
user_query = inputs["messages"][-1]["content"]
# Call the Judge LLM using the OpenAI SDK client.
judge_llm_response_obj = client.chat.completions.create(
model="databricks-claude-3-7-sonnet", # This example uses Databricks hosted Claude. If you provide your own OpenAI credentials, replace with a valid OpenAI model e.g., gpt-4o-mini, etc.
messages=[
{"role": "system", "content": judge_system_prompt},
{"role": "user", "content": judge_user_prompt.format(user_query=user_query, llm_response_from_app=outputs)},
],
max_tokens=200, # Max tokens for the judge's rationale
temperature=0.0, # For more deterministic judging
)
judge_llm_output_text = judge_llm_response_obj.choices[0].message.content
# Parse the Judge LLM's JSON output.
judge_eval_json = json.loads(judge_llm_output_text)
parsed_score = int(judge_eval_json["score"])
parsed_rationale = judge_eval_json["rationale"]
return Feedback(
value=parsed_score,
rationale=parsed_rationale,
# Set the source of the assessment to indicate the LLM judge used to generate the feedback
source=AssessmentSource(
source_type=AssessmentSourceType.LLM_JUDGE,
source_id="claude-3-7-sonnet",
)
)
# Evaluate the scorer using the pre-generated traces.
custom_llm_judge_eval_results = mlflow.genai.evaluate(
data=generated_traces,
scorers=[answer_quality]
)
By opening the trace in the UI and click on the "answer_quality" assessment, you can see the metadata for the judge, such as rationale, timestamp, judge model name, etc. If the judge assessment is not correct, you can override the score by clicking Edit
button.
The new assessment will supersedes the original judge assessment, but the edit history will be preserved for future reference.
Next steps
Continue your journey with these recommended actions and tutorials.
- Evaluate with custom LLM scorers - Create semantic evaluation using LLMs
- Run scorers in production - Deploy your scorers for continuous monitoring
- Build evaluation datasets - Create test data for your scorers
Reference guides
Explore detailed documentation for concepts and features mentioned in this guide.
- Scorers - Deep dive into how scorers work and their architecture
- Evaluation Harness - Understand how
mlflow.genai.evaluate()
uses your scorers - LLM judges - Learn the foundation for AI-powered evaluation