Evaluation Harness

2025-06-11

The mlflow.genai.evaluate() function systematically tests GenAI app quality by running it against test data (evaluation datasets) and applying scorers.

Quick reference

Parameter	Type	Description
`data`	MLflow EvaluationDataset, List[Dict], Pandas DataFrame, Spark DataFrame	Test data
`predict_fn`	Callable	Your app (Mode 1 only)
`scorers`	List[Scorer]	Quality metrics
`model_id`	str	Optional version tracking

How it works

Runs your app on test inputs, capturing traces
Applies scorers to assess quality, creating Feedback
Stores results in an Evaluation Run

Prerequisites

Install MLflow and required packages

pip install --upgrade "mlflow[databricks]>=3.1.0" openai "databricks-connect>=16.1"

Create an MLflow experiment by following the setup your environment quickstart.

Two evaluation modes

Mode 1: Direct evaluation (recommended)

MLflow calls your GenAI app directly to generate and evaluate traces. You can either pass your application's entry point wrapped in a Python function (predict_fn) or, if your app is deployed as a Databricks Model Serving endpoint, pass that endpoint wrapped in to_predict_fn.

Benefits:

Allows scorers to be easily reused between offline evaluation and production monitoring
Automatic parallelization of your app's execution for faster evaluation

By calling your app directly, this mode enables you to reuse the scorers defined for offline evaluation in production monitoring since the resulting traces will be identical.

How evaluate works with tracing

Step 1: Run evaluation

import mlflow
from mlflow.genai.scorers import RelevanceToQuery, Safety

# Your GenAI app with MLflow tracing
@mlflow.trace
def my_chatbot_app(question: str) -> dict:
    # Your app logic here
    if "MLflow" in question:
        response = "MLflow is an open-source platform for managing ML and GenAI workflows."
    else:
        response = "I can help you with MLflow questions."

    return {"response": response}

# Evaluate your app
results = mlflow.genai.evaluate(
    data=[
        {"inputs": {"question": "What is MLflow?"}},
        {"inputs": {"question": "How do I get started?"}}
    ],
    predict_fn=my_chatbot_app,
    scorers=[RelevanceToQuery(), Safety()]
)

Step 2: View results in the UI

Evaluation results

Mode 2: Answer sheet evaluation

Provide pre-computed outputs or existing traces for evaluation when you can't run your GenAI app directly.

Use cases:

Testing outputs from external systems
Evaluating historical traces
Comparing outputs across different platforms

Warning

If you use an answer sheet with different traces than your production environment, you may need to re-write your scorer functions to use them for production monitoring.

How evaluate works with answer sheet

Example (with inputs/outputs):

Step 1: Run evaluation

import mlflow
from mlflow.genai.scorers import Safety, RelevanceToQuery

# Pre-computed results from your GenAI app
results_data = [
    {
        "inputs": {"question": "What is MLflow?"},
        "outputs": {"response": "MLflow is an open-source platform for managing machine learning workflows, including tracking experiments, packaging code, and deploying models."},
    },
    {
        "inputs": {"question": "How do I get started?"},
        "outputs": {"response": "To get started with MLflow, install it using 'pip install mlflow' and then run 'mlflow ui' to launch the web interface."},
    }
]

# Evaluate pre-computed outputs
evaluation = mlflow.genai.evaluate(
    data=results_data,
    scorers=[Safety(), RelevanceToQuery()]
)

Step 2: View results in the UI

Evaluation results

Example with existing traces:

import mlflow

# Retrieve traces from production
traces = mlflow.search_traces(
    filter_string="trace.status = 'OK'",
)

# Evaluate problematic traces
evaluation = mlflow.genai.evaluate(
    data=traces,
    scorers=[Safety(), RelevanceToQuery()]
)

Key parameters

def mlflow.genai.evaluate(
    data: Union[pd.DataFrame, List[Dict], mlflow.genai.datasets.EvaluationDataset],
    scorers: list[mlflow.genai.scorers.Scorer],
    predict_fn: Optional[Callable[..., Any]] = None,
    model_id: Optional[str] = None,
) -> mlflow.models.evaluation.base.EvaluationResult:

`data`

Your evaluation dataset in one of these formats:

EvaluationDataset (recommended)
List of dictionaries, Pandas DataFrame, or Spark DataFrame

If the data argument is provided as a DataFrame or list of dictionaries, it must follow the following schema. This is consistent with the schema used by EvaluationDataset. We recommend using an EvaluationDataset as it will enforce schema validation, in addition to tracking the lineage of each record.

Field	Data type	Description	Required if app is passed to `predict_fn` (mode 1)?	Required if providing an answer sheet (mode 2)?
`inputs`	`dict[Any, Any]`	A `dict` that will be passed to your `predict_fn` using `**kwargs`. Must be JSON serializable. Each key must correspond to a named argument in `predict_fn`.	Required	Either `inputs` + `outputs` or `trace` is required. Cannot pass both. Derived from `trace` if not provided.
`outputs`	`dict[Any, Any]`	A `dict` with the outputs of your GenAI app for the corresponding `input`. Must be JSON serializable.	Must NOT be provided, generated by MLflow from the Trace	Either `inputs` + `outputs` or `trace` is required. Cannot pass both. Derived from `trace` if not provided.
`expectations`	`dict[str, Any]`	A `dict` with ground-truth labels corresponding to `input`. Used by `scorers` to check quality. Must be JSON serializable and each key must be a `str`.	Optional	Optional
`trace`	`mlflow.entities.Trace`	The trace object for the request. If the `trace` is provided, the `expectations` can be provided as `Assessments` on the `trace` rather than as a separate column.	Must NOT be provided, generated by MLflow from the Trace	Either `inputs` + `outputs` or `trace` is required. Cannot pass both.

`predict_fn`

Your GenAI app's entry point (Mode 1 only). Must:

Accept the keys from the inputs dictionary in data as keyword arguments
Return a JSON-serializable dictionary
Be instrumented with MLflow Tracing
Emit exactly one trace per call

`scorers`

List of quality metrics to apply. You can provide:

See Scorers for more details.

`model_id`

Optional model identifier to link results to your app version (e.g., "models:/my-app/1"). See Version Tracking for more details.

Data formats

For direct evaluation (Mode 1)

Field	Required	Description
`inputs`	✅	Dictionary passed to your `predict_fn`
`expectations`	Optional	Optional ground truth for scorers

For answer sheet evaluation (Mode 2)

Option A - Provide inputs and outputs:

Field	Required	Description
`inputs`	✅	Original inputs to your GenAI app
`outputs`	✅	Pre-computed outputs from your app
`expectations`	Optional	Optional ground truth for scorers

Option B - Provide existing traces:

Field	Required	Description
`trace`	✅	MLflow Trace objects with inputs/outputs
`expectations`	Optional	Optional ground truth for scorers

Common data input patterns

Evaluate with a MLflow Evaluation Dataset (recommended)

MLflow Evaluation Datasets provide versioning, lineage tracking, and Unity Catalog integration for production-ready evaluation.

import mlflow
from mlflow.genai.scorers import Correctness, Safety
from my_app import agent  # Your GenAI app with tracing

# Load versioned evaluation dataset
dataset = mlflow.genai.datasets.get_dataset("catalog.schema.eval_dataset_name")

# Run evaluation
results = mlflow.genai.evaluate(
    data=dataset,
    predict_fn=agent,
    scorers=[Correctness(), Safety()],
)

Use for:

Need to have evaluation data with version control and lineage tracking
Easily converting traces to evaluation records

See Build evaluation datasets to create datasets from traces or scratch.

Evaluate with a list of dictionaries

Use a simple list of dictionaries for quick prototyping without creating a formal evaluation dataset.

import mlflow
from mlflow.genai.scorers import Correctness, RelevanceToQuery
from my_app import agent  # Your GenAI app with tracing

# Define test data as a list of dictionaries
eval_data = [
    {
        "inputs": {"question": "What is MLflow?"},
        "expectations": {"expected_facts": ["open-source platform", "ML lifecycle management"]}
    },
    {
        "inputs": {"question": "How do I track experiments?"},
        "expectations": {"expected_facts": ["mlflow.start_run()", "log metrics", "log parameters"]}
    },
    {
        "inputs": {"question": "What are MLflow's main components?"},
        "expectations": {"expected_facts": ["Tracking", "Projects", "Models", "Registry"]}
    }
]

# Run evaluation
results = mlflow.genai.evaluate(
    data=eval_data,
    predict_fn=agent,
    scorers=[Correctness(), RelevanceToQuery()],
)

Use for:

Quick prototyping
Small datasets (< 100 examples)
Ad-hoc development testing

For production, convert to an MLflow Evaluation Dataset.

Evaluate with a Pandas DataFrame

Use Pandas DataFrames for evaluation when working with CSV files or existing data science workflows.

import mlflow
import pandas as pd
from mlflow.genai.scorers import Correctness, Safety
from my_app import agent  # Your GenAI app with tracing

# Create evaluation data as a Pandas DataFrame
eval_df = pd.DataFrame([
    {
        "inputs": {"question": "What is MLflow?"},
        "expectations": {"expected_response": "MLflow is an open-source platform for ML lifecycle management"}
    },
    {
        "inputs": {"question": "How do I log metrics?"},
        "expectations": {"expected_response": "Use mlflow.log_metric() to log metrics"}
    }
])

# Run evaluation
results = mlflow.genai.evaluate(
    data=eval_df,
    predict_fn=agent,
    scorers=[Correctness(), Safety()],
)

Use for:

Quick prototyping
Small datasets (< 100 examples)
Ad-hoc development testing

Evaluate with a Spark DataFrame

Use Spark DataFrames for large-scale evaluations or when data is already in Delta Lake/Unity Catalog.

import mlflow
from mlflow.genai.scorers import Safety, RelevanceToQuery
from my_app import agent  # Your GenAI app with tracing

# Load evaluation data from a Delta table in Unity Catalog
eval_df = spark.table("catalog.schema.evaluation_data")

# Or load from any Spark-compatible source
# eval_df = spark.read.parquet("path/to/evaluation/data")

# Run evaluation
results = mlflow.genai.evaluate(
    data=eval_df,
    predict_fn=agent,
    scorers=[Safety(), RelevanceToQuery()],
)

Use for:

Data exists already in Delta Lake or Unity Catalog
If you need to filter the records in an MLflow Evaluation Dataset before running evaluation

Note: DataFrame must comply with the evaluation dataset schema.

Common `predict_fn` patterns

Call your app directly

Pass your app directly as predict_fn when parameter names match your evaluation dataset keys.

import mlflow
from mlflow.genai.scorers import RelevanceToQuery, Safety

# Your GenAI app that accepts 'question' as a parameter
@mlflow.trace
def my_chatbot_app(question: str) -> dict:
    # Your app logic here
    response = f"I can help you with: {question}"
    return {"response": response}

# Evaluation data with 'question' key matching the function parameter
eval_data = [
    {"inputs": {"question": "What is MLflow?"}},
    {"inputs": {"question": "How do I track experiments?"}}
]

# Pass your app directly since parameter names match
results = mlflow.genai.evaluate(
    data=eval_data,
    predict_fn=my_chatbot_app,  # Direct reference, no wrapper needed
    scorers=[RelevanceToQuery(), Safety()]
)

Use for:

Apps that have parameter names that match your evaluation dataset's inputs

Wrap your app in a callable

Wrap your app when it expects different parameter names or data structures than your evaluation dataset's inputs.

import mlflow
from mlflow.genai.scorers import RelevanceToQuery, Safety

# Your existing GenAI app with different parameter names
@mlflow.trace
def customer_support_bot(user_message: str, chat_history: list = None) -> dict:
    # Your app logic here
    context = f"History: {chat_history}" if chat_history else "New conversation"
    return {
        "bot_response": f"Helping with: {user_message}. {context}",
        "confidence": 0.95
    }

# Wrapper function to translate evaluation data to your app's interface
def evaluate_support_bot(question: str, history: str = None) -> dict:
    # Convert evaluation dataset format to your app's expected format
    chat_history = history.split("|") if history else []

    # Call your app with the translated parameters
    result = customer_support_bot(
        user_message=question,
        chat_history=chat_history
    )

    # Translate output to standard format if needed
    return {
        "response": result["bot_response"],
        "confidence_score": result["confidence"]
    }

# Evaluation data with different key names
eval_data = [
    {"inputs": {"question": "Reset password", "history": "logged in|forgot email"}},
    {"inputs": {"question": "Track my order"}}
]

# Use the wrapper function for evaluation
results = mlflow.genai.evaluate(
    data=eval_data,
    predict_fn=evaluate_support_bot,  # Wrapper handles translation
    scorers=[RelevanceToQuery(), Safety()]
)

Use for:

Parameter name mismatches between your app's parameters and evaluation dataset input keys (e.g., user_input vs question)
Data format conversions (string to list, JSON parsing)

Evaluate a deployed endpoint

For Databricks Agent Framework or Model Serving endpoints, use to_predict_fn to create a compatible predict function.

import mlflow
from mlflow.genai.scorers import RelevanceToQuery

# Create predict function for your endpoint
predict_fn = mlflow.genai.to_predict_fn("endpoints:/my-chatbot-endpoint")

# Evaluate
results = mlflow.genai.evaluate(
    data=[{"inputs": {"question": "How does MLflow work?"}}],
    predict_fn=predict_fn,
    scorers=[RelevanceToQuery()]
)

Benefit: Automatically extracts traces from tracing-enabled endpoints for full observability.

Evaluate a logged model

Wrap logged MLflow models to translate between evaluation's named parameters and the model's single-parameter interface.

Most logged models (such as those using PyFunc or logging flavors like LangChain) accept a single input parameter (e.g., model_inputs for PyFunc), while predict_fn expects named parameters that correspond to the keys in your evaluation dataset.

import mlflow
from mlflow.genai.scorers import Safety

# Make sure to load your logged model outside of the predict_fn so MLflow only loads it once!
model = mlflow.pyfunc.load_model("models:/chatbot/staging")

def evaluate_model(question: str) -> dict:
    return model.predict({"question": question})

results = mlflow.genai.evaluate(
    data=[{"inputs": {"question": "Tell me about MLflow"}}],
    predict_fn=evaluate_model,
    scorers=[Safety()]
)

Next Steps

Evaluate your app - Step-by-step guide to running your first evaluation
Build evaluation datasets - Create structured test data from production logs or scratch
Define custom scorers - Build metrics tailored to your specific use case

Share via

Evaluation Harness

Quick reference

How it works

Prerequisites

Two evaluation modes

Mode 1: Direct evaluation (recommended)

Step 1: Run evaluation

Step 2: View results in the UI

Mode 2: Answer sheet evaluation

Step 1: Run evaluation

Step 2: View results in the UI

Key parameters

data

predict_fn

scorers

model_id

Data formats

For direct evaluation (Mode 1)

For answer sheet evaluation (Mode 2)

Common data input patterns

Evaluate with a MLflow Evaluation Dataset (recommended)

Evaluate with a list of dictionaries

Evaluate with a Pandas DataFrame

Evaluate with a Spark DataFrame

Common predict_fn patterns

Call your app directly

Wrap your app in a callable

Evaluate a deployed endpoint

Evaluate a logged model

Next Steps

Feedback

Additional resources

`data`

`predict_fn`

`scorers`

`model_id`

Common `predict_fn` patterns