Building MLflow Evaluation Datasets

2025-06-11

This guide shows you the various ways to create evaluation datasets in order to systematically test and improve your GenAI application's quality. You'll learn multiple approaches to build datasets that enable consistent, repeatable evaluation as you iterate on your app.

Evaluation datasets help you:

Fix known issues: Add problematic examples from production to repeatedly test fixes
Prevent regressions: Create a "golden set" of examples that must always work correctly
Compare versions: Test different prompts, models, or app logic against the same data
Target specific features: Build specialized datasets for safety, ___domain knowledge, or edge cases

Start with a single well-curated dataset, then expand to multiple datasets as your testing needs grow.

What you'll learn:

Create datasets from production traces to test real-world scenarios
Build datasets from scratch for targeted testing of specific features
Import existing evaluation data from CSV, JSON, or other formats
Generate synthetic test data to expand coverage
Add ground truth labels from ___domain expert feedback

Note

This guide shows you how to use MLflow-managed evaluation datasets, which provide version history and lineage tracking. For rapid prototyping, you can also provide your evaluation dataset as a Python dictionary or Pandas/Spark dataframe that follows the same schema of the MLflow-managed dataset. To learn more about the evaluation dataset schema, refer to the evaluation datasets reference page.

Prerequisites

Install MLflow and required packages

pip install --upgrade "mlflow[databricks]>=3.1.0"

Create an MLflow experiment by following the setup your environment quickstart.
Access to a Unity Catalog schema with CREATE TABLE permissions to create evaluation datasets.

Approaches to Building Your Dataset

MLflow offers several flexible ways to construct an evaluation dataset tailored to your needs:

Creating a dataset from existing traces: Leverage the rich data already captured in your MLflow Traces.
Importing a dataset or building a dataset from scratch: Manually define specific input examples and (optionally) expected outputs.
Seeding an evaluation dataset with synthetic data: Generate diverse inputs automatically.

Choose the method or combination of methods that best suits your current data sources and evaluation goals.

Step 1: Create a dataset

Irregardless of the method you choose, first, you must create a MLflow-managed evaluation dataset. This approach allows you to track changes to the dataset over time and link individual evaluation results to this dataset.

Using the UI

Follow the recording below to use the UI to create an evaluation dataset

trace

Using the SDK

Create an evaluation dataset programmatically by searching for traces and adding them to the dataset.

import mlflow
import mlflow.genai.datasets
import time
from databricks.connect import DatabricksSession

# 0. If you are using a local development environment, connect to Serverless Spark which powers MLflow's evaluation dataset service
spark = DatabricksSession.builder.remote(serverless=True).getOrCreate()

# 1. Create an evaluation dataset

# Replace with a Unity Catalog schema where you have CREATE TABLE permission
uc_schema = "workspace.default"
# This table will be created in the above UC schema
evaluation_dataset_table_name = "email_generation_eval"

eval_dataset = mlflow.genai.datasets.create_dataset(
    uc_table_name=f"{uc_schema}.{evaluation_dataset_table_name}",
)
print(f"Created evaluation dataset: {uc_schema}.{evaluation_dataset_table_name}")

Step 2: Add records to your dataset

Approach 1: Create from existing traces

One of the most effective ways to build a relevant evaluation dataset is by curating examples directly from your application's historical interactions captured by MLflow Tracing. You can create datasets from traces using either the MLflow Monitoring UI or the SDK.

Using the UI

Follow the recording below to use the UI to add existing production traces to the dataset

trace

Using the SDK

Programmatically search for traces and then add them to the dataset. Refer to the query traces reference page for details on how to use filters in search_traces().

import mlflow

# 2. Search for traces
traces = mlflow.search_traces(
    filter_string="attributes.status = 'OK'",
    order_by=["attributes.timestamp_ms DESC"]
    max_results=10
)

print(f"Found {len(traces)} successful traces")

# 3. Add the traces to the evaluation dataset
eval_dataset.merge_records(traces)
print(f"Added {len(traces)} records to evaluation dataset")

# Preview the dataset
df = eval_dataset.to_df()
print(f"\nDataset preview:")
print(f"Total records: {len(df)}")
print("\nSample record:")
sample = df.iloc[0]
print(f"Inputs: {sample['inputs']}")

Approach 2: Create from ___domain expert labels

Leverage feedback from ___domain experts captured in MLflow Labeling Sessions to enrich your evaluation datasets with ground truth labels. Before doing these steps, follow the collect ___domain expert feedback guide to create a labeling session.

import mlflow.genai.labeling as labeling

# Get a labeling sessions
all_sessions = labeling.get_labeling_sessions()
print(f"Found {len(all_sessions)} sessions")

for session in all_sessions:
    print(f"- {session.name} (ID: {session.labeling_session_id})")
    print(f"  Assigned users: {session.assigned_users}")

# Sync from the labeling session to the dataset

all_sessions[0].sync_expectations(dataset_name=f"{uc_schema}.{evaluation_dataset_table_name}")

Approach 3: Build from scratch or import existing

You can import an existing dataset or curate examples from scratch. Your data must match (or be transformed to match) the evaluation dataset schema.

# Define comprehensive test cases
evaluation_examples = [
    {
        "inputs": {"question": "What is MLflow?"},
        "expected": {
            "expected_response": "MLflow is an open source platform for managing the end-to-end machine learning lifecycle.",
            "expected_facts": [
                "open source platform",
                "manages ML lifecycle",
                "experiment tracking",
                "model deployment"
            ]
        },
    },
]

eval_dataset.merge_records(evaluation_examples)

Approach 4: Seed using synthetic data

Generating synthetic data can expand your testing efforts by quickly creating diverse inputs and covering edge cases. To learn more, visit the synthesize evaluation datasets reference.

Next steps

Continue your journey with these recommended actions and tutorials.

Evaluate your app - Use your newly created dataset for evaluation
Create custom scorers - Build scorers to evaluate against ground truth

Reference guides

Explore detailed documentation for concepts and features mentioned in this guide.

Evaluation Datasets - Deep dive into dataset structure and capabilities
Evaluation Harness - Learn how mlflow.genai.evaluate() uses your datasets
Tracing data model - Understand traces as a source for evaluation datasets

Share via

Building MLflow Evaluation Datasets

Prerequisites

Approaches to Building Your Dataset

Step 1: Create a dataset

Using the UI

Using the SDK

Step 2: Add records to your dataset

Approach 1: Create from existing traces

Using the UI

Using the SDK

Approach 2: Create from ___domain expert labels

Approach 3: Build from scratch or import existing

Approach 4: Seed using synthetic data

Next steps

Reference guides

Feedback

Additional resources