Share via


Evaluation Harness

The mlflow.genai.evaluate() function systematically tests GenAI app quality by running it against test data (evaluation datasets) and applying scorers.

Quick reference

Parameter Type Description
data MLflow EvaluationDataset, List[Dict], Pandas DataFrame, Spark DataFrame Test data
predict_fn Callable Your app (Mode 1 only)
scorers List[Scorer] Quality metrics
model_id str Optional version tracking

How it works

  1. Runs your app on test inputs, capturing traces
  2. Applies scorers to assess quality, creating Feedback
  3. Stores results in an Evaluation Run

Prerequisites

  1. Install MLflow and required packages

    pip install --upgrade "mlflow[databricks]>=3.1.0" openai "databricks-connect>=16.1"
    
  2. Create an MLflow experiment by following the setup your environment quickstart.

Two evaluation modes

Mode 1: Direct evaluation (recommended)

MLflow calls your GenAI app directly to generate and evaluate traces. You can either pass your application's entry point wrapped in a Python function (predict_fn) or, if your app is deployed as a Databricks Model Serving endpoint, pass that endpoint wrapped in to_predict_fn.

Benefits:

  • Allows scorers to be easily reused between offline evaluation and production monitoring
  • Automatic parallelization of your app's execution for faster evaluation

By calling your app directly, this mode enables you to reuse the scorers defined for offline evaluation in production monitoring since the resulting traces will be identical.

How evaluate works with tracing

Step 1: Run evaluation

import mlflow
from mlflow.genai.scorers import RelevanceToQuery, Safety

# Your GenAI app with MLflow tracing
@mlflow.trace
def my_chatbot_app(question: str) -> dict:
    # Your app logic here
    if "MLflow" in question:
        response = "MLflow is an open-source platform for managing ML and GenAI workflows."
    else:
        response = "I can help you with MLflow questions."

    return {"response": response}

# Evaluate your app
results = mlflow.genai.evaluate(
    data=[
        {"inputs": {"question": "What is MLflow?"}},
        {"inputs": {"question": "How do I get started?"}}
    ],
    predict_fn=my_chatbot_app,
    scorers=[RelevanceToQuery(), Safety()]
)

Step 2: View results in the UI

Evaluation results

Mode 2: Answer sheet evaluation

Provide pre-computed outputs or existing traces for evaluation when you can't run your GenAI app directly.

Use cases:

  • Testing outputs from external systems
  • Evaluating historical traces
  • Comparing outputs across different platforms

Warning

If you use an answer sheet with different traces than your production environment, you may need to re-write your scorer functions to use them for production monitoring.

How evaluate works with answer sheet

Example (with inputs/outputs):

Step 1: Run evaluation

import mlflow
from mlflow.genai.scorers import Safety, RelevanceToQuery

# Pre-computed results from your GenAI app
results_data = [
    {
        "inputs": {"question": "What is MLflow?"},
        "outputs": {"response": "MLflow is an open-source platform for managing machine learning workflows, including tracking experiments, packaging code, and deploying models."},
    },
    {
        "inputs": {"question": "How do I get started?"},
        "outputs": {"response": "To get started with MLflow, install it using 'pip install mlflow' and then run 'mlflow ui' to launch the web interface."},
    }
]

# Evaluate pre-computed outputs
evaluation = mlflow.genai.evaluate(
    data=results_data,
    scorers=[Safety(), RelevanceToQuery()]
)

Step 2: View results in the UI

Evaluation results

Example with existing traces:

import mlflow

# Retrieve traces from production
traces = mlflow.search_traces(
    filter_string="trace.status = 'OK'",
)

# Evaluate problematic traces
evaluation = mlflow.genai.evaluate(
    data=traces,
    scorers=[Safety(), RelevanceToQuery()]
)

Key parameters

def mlflow.genai.evaluate(
    data: Union[pd.DataFrame, List[Dict], mlflow.genai.datasets.EvaluationDataset],
    scorers: list[mlflow.genai.scorers.Scorer],
    predict_fn: Optional[Callable[..., Any]] = None,
    model_id: Optional[str] = None,
) -> mlflow.models.evaluation.base.EvaluationResult:

data

Your evaluation dataset in one of these formats:

  • EvaluationDataset (recommended)
  • List of dictionaries, Pandas DataFrame, or Spark DataFrame

If the data argument is provided as a DataFrame or list of dictionaries, it must follow the following schema. This is consistent with the schema used by EvaluationDataset. We recommend using an EvaluationDataset as it will enforce schema validation, in addition to tracking the lineage of each record.

Field Data type Description Required if app is passed to predict_fn (mode 1)? Required if providing an answer sheet (mode 2)?
inputs dict[Any, Any] A dict that will be passed to your predict_fn using **kwargs. Must be JSON serializable. Each key must correspond to a named argument in predict_fn. Required Either inputs + outputs or trace is required. Cannot pass both.
Derived from trace if not provided.
outputs dict[Any, Any] A dict with the outputs of your GenAI app for the corresponding input. Must be JSON serializable. Must NOT be provided, generated by MLflow from the Trace Either inputs + outputs or trace is required. Cannot pass both.
Derived from trace if not provided.
expectations dict[str, Any] A dict with ground-truth labels corresponding to input. Used by scorers to check quality. Must be JSON serializable and each key must be a str. Optional Optional
trace mlflow.entities.Trace The trace object for the request. If the trace is provided, the expectations can be provided as Assessments on the trace rather than as a separate column. Must NOT be provided, generated by MLflow from the Trace Either inputs + outputs or trace is required. Cannot pass both.

predict_fn

Your GenAI app's entry point (Mode 1 only). Must:

  • Accept the keys from the inputs dictionary in data as keyword arguments
  • Return a JSON-serializable dictionary
  • Be instrumented with MLflow Tracing
  • Emit exactly one trace per call

scorers

List of quality metrics to apply. You can provide:

See Scorers for more details.

model_id

Optional model identifier to link results to your app version (e.g., "models:/my-app/1"). See Version Tracking for more details.

Data formats

For direct evaluation (Mode 1)

Field Required Description
inputs Dictionary passed to your predict_fn
expectations Optional Optional ground truth for scorers

For answer sheet evaluation (Mode 2)

Option A - Provide inputs and outputs:

Field Required Description
inputs Original inputs to your GenAI app
outputs Pre-computed outputs from your app
expectations Optional Optional ground truth for scorers

Option B - Provide existing traces:

Field Required Description
trace MLflow Trace objects with inputs/outputs
expectations Optional Optional ground truth for scorers

Common data input patterns

Evaluate with a MLflow Evaluation Dataset (recommended)

MLflow Evaluation Datasets provide versioning, lineage tracking, and Unity Catalog integration for production-ready evaluation.

import mlflow
from mlflow.genai.scorers import Correctness, Safety
from my_app import agent  # Your GenAI app with tracing

# Load versioned evaluation dataset
dataset = mlflow.genai.datasets.get_dataset("catalog.schema.eval_dataset_name")

# Run evaluation
results = mlflow.genai.evaluate(
    data=dataset,
    predict_fn=agent,
    scorers=[Correctness(), Safety()],
)

Use for:

  • Need to have evaluation data with version control and lineage tracking
  • Easily converting traces to evaluation records

See Build evaluation datasets to create datasets from traces or scratch.

Evaluate with a list of dictionaries

Use a simple list of dictionaries for quick prototyping without creating a formal evaluation dataset.

import mlflow
from mlflow.genai.scorers import Correctness, RelevanceToQuery
from my_app import agent  # Your GenAI app with tracing

# Define test data as a list of dictionaries
eval_data = [
    {
        "inputs": {"question": "What is MLflow?"},
        "expectations": {"expected_facts": ["open-source platform", "ML lifecycle management"]}
    },
    {
        "inputs": {"question": "How do I track experiments?"},
        "expectations": {"expected_facts": ["mlflow.start_run()", "log metrics", "log parameters"]}
    },
    {
        "inputs": {"question": "What are MLflow's main components?"},
        "expectations": {"expected_facts": ["Tracking", "Projects", "Models", "Registry"]}
    }
]

# Run evaluation
results = mlflow.genai.evaluate(
    data=eval_data,
    predict_fn=agent,
    scorers=[Correctness(), RelevanceToQuery()],
)

Use for:

  • Quick prototyping
  • Small datasets (< 100 examples)
  • Ad-hoc development testing

For production, convert to an MLflow Evaluation Dataset.

Evaluate with a Pandas DataFrame

Use Pandas DataFrames for evaluation when working with CSV files or existing data science workflows.

import mlflow
import pandas as pd
from mlflow.genai.scorers import Correctness, Safety
from my_app import agent  # Your GenAI app with tracing

# Create evaluation data as a Pandas DataFrame
eval_df = pd.DataFrame([
    {
        "inputs": {"question": "What is MLflow?"},
        "expectations": {"expected_response": "MLflow is an open-source platform for ML lifecycle management"}
    },
    {
        "inputs": {"question": "How do I log metrics?"},
        "expectations": {"expected_response": "Use mlflow.log_metric() to log metrics"}
    }
])

# Run evaluation
results = mlflow.genai.evaluate(
    data=eval_df,
    predict_fn=agent,
    scorers=[Correctness(), Safety()],
)

Use for:

  • Quick prototyping
  • Small datasets (< 100 examples)
  • Ad-hoc development testing

Evaluate with a Spark DataFrame

Use Spark DataFrames for large-scale evaluations or when data is already in Delta Lake/Unity Catalog.

import mlflow
from mlflow.genai.scorers import Safety, RelevanceToQuery
from my_app import agent  # Your GenAI app with tracing

# Load evaluation data from a Delta table in Unity Catalog
eval_df = spark.table("catalog.schema.evaluation_data")

# Or load from any Spark-compatible source
# eval_df = spark.read.parquet("path/to/evaluation/data")

# Run evaluation
results = mlflow.genai.evaluate(
    data=eval_df,
    predict_fn=agent,
    scorers=[Safety(), RelevanceToQuery()],
)

Use for:

  • Data exists already in Delta Lake or Unity Catalog
  • If you need to filter the records in an MLflow Evaluation Dataset before running evaluation

Note: DataFrame must comply with the evaluation dataset schema.

Common predict_fn patterns

Call your app directly

Pass your app directly as predict_fn when parameter names match your evaluation dataset keys.

import mlflow
from mlflow.genai.scorers import RelevanceToQuery, Safety

# Your GenAI app that accepts 'question' as a parameter
@mlflow.trace
def my_chatbot_app(question: str) -> dict:
    # Your app logic here
    response = f"I can help you with: {question}"
    return {"response": response}

# Evaluation data with 'question' key matching the function parameter
eval_data = [
    {"inputs": {"question": "What is MLflow?"}},
    {"inputs": {"question": "How do I track experiments?"}}
]

# Pass your app directly since parameter names match
results = mlflow.genai.evaluate(
    data=eval_data,
    predict_fn=my_chatbot_app,  # Direct reference, no wrapper needed
    scorers=[RelevanceToQuery(), Safety()]
)

Use for:

  • Apps that have parameter names that match your evaluation dataset's inputs

Wrap your app in a callable

Wrap your app when it expects different parameter names or data structures than your evaluation dataset's inputs.

import mlflow
from mlflow.genai.scorers import RelevanceToQuery, Safety

# Your existing GenAI app with different parameter names
@mlflow.trace
def customer_support_bot(user_message: str, chat_history: list = None) -> dict:
    # Your app logic here
    context = f"History: {chat_history}" if chat_history else "New conversation"
    return {
        "bot_response": f"Helping with: {user_message}. {context}",
        "confidence": 0.95
    }

# Wrapper function to translate evaluation data to your app's interface
def evaluate_support_bot(question: str, history: str = None) -> dict:
    # Convert evaluation dataset format to your app's expected format
    chat_history = history.split("|") if history else []

    # Call your app with the translated parameters
    result = customer_support_bot(
        user_message=question,
        chat_history=chat_history
    )

    # Translate output to standard format if needed
    return {
        "response": result["bot_response"],
        "confidence_score": result["confidence"]
    }

# Evaluation data with different key names
eval_data = [
    {"inputs": {"question": "Reset password", "history": "logged in|forgot email"}},
    {"inputs": {"question": "Track my order"}}
]

# Use the wrapper function for evaluation
results = mlflow.genai.evaluate(
    data=eval_data,
    predict_fn=evaluate_support_bot,  # Wrapper handles translation
    scorers=[RelevanceToQuery(), Safety()]
)

Use for:

  • Parameter name mismatches between your app's parameters and evaluation dataset input keys (e.g., user_input vs question)
  • Data format conversions (string to list, JSON parsing)

Evaluate a deployed endpoint

For Databricks Agent Framework or Model Serving endpoints, use to_predict_fn to create a compatible predict function.

import mlflow
from mlflow.genai.scorers import RelevanceToQuery

# Create predict function for your endpoint
predict_fn = mlflow.genai.to_predict_fn("endpoints:/my-chatbot-endpoint")

# Evaluate
results = mlflow.genai.evaluate(
    data=[{"inputs": {"question": "How does MLflow work?"}}],
    predict_fn=predict_fn,
    scorers=[RelevanceToQuery()]
)

Benefit: Automatically extracts traces from tracing-enabled endpoints for full observability.

Evaluate a logged model

Wrap logged MLflow models to translate between evaluation's named parameters and the model's single-parameter interface.

Most logged models (such as those using PyFunc or logging flavors like LangChain) accept a single input parameter (e.g., model_inputs for PyFunc), while predict_fn expects named parameters that correspond to the keys in your evaluation dataset.

import mlflow
from mlflow.genai.scorers import Safety

# Make sure to load your logged model outside of the predict_fn so MLflow only loads it once!
model = mlflow.pyfunc.load_model("models:/chatbot/staging")

def evaluate_model(question: str) -> dict:
    return model.predict({"question": question})

results = mlflow.genai.evaluate(
    data=[{"inputs": {"question": "Tell me about MLflow"}}],
    predict_fn=evaluate_model,
    scorers=[Safety()]
)

Next Steps