Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
This guide shows you the various ways to create evaluation datasets in order to systematically test and improve your GenAI application's quality. You'll learn multiple approaches to build datasets that enable consistent, repeatable evaluation as you iterate on your app.
Evaluation datasets help you:
- Fix known issues: Add problematic examples from production to repeatedly test fixes
- Prevent regressions: Create a "golden set" of examples that must always work correctly
- Compare versions: Test different prompts, models, or app logic against the same data
- Target specific features: Build specialized datasets for safety, ___domain knowledge, or edge cases
Start with a single well-curated dataset, then expand to multiple datasets as your testing needs grow.
What you'll learn:
- Create datasets from production traces to test real-world scenarios
- Build datasets from scratch for targeted testing of specific features
- Import existing evaluation data from CSV, JSON, or other formats
- Generate synthetic test data to expand coverage
- Add ground truth labels from ___domain expert feedback
Note
This guide shows you how to use MLflow-managed evaluation datasets, which provide version history and lineage tracking. For rapid prototyping, you can also provide your evaluation dataset as a Python dictionary or Pandas/Spark dataframe that follows the same schema of the MLflow-managed dataset. To learn more about the evaluation dataset schema, refer to the evaluation datasets reference page.
Prerequisites
Install MLflow and required packages
pip install --upgrade "mlflow[databricks]>=3.1.0"
Create an MLflow experiment by following the setup your environment quickstart.
Access to a Unity Catalog schema with
CREATE TABLE
permissions to create evaluation datasets.
Approaches to Building Your Dataset
MLflow offers several flexible ways to construct an evaluation dataset tailored to your needs:
- Creating a dataset from existing traces: Leverage the rich data already captured in your MLflow Traces.
- Importing a dataset or building a dataset from scratch: Manually define specific input examples and (optionally) expected outputs.
- Seeding an evaluation dataset with synthetic data: Generate diverse inputs automatically.
Choose the method or combination of methods that best suits your current data sources and evaluation goals.
Step 1: Create a dataset
Irregardless of the method you choose, first, you must create a MLflow-managed evaluation dataset. This approach allows you to track changes to the dataset over time and link individual evaluation results to this dataset.
Using the UI
Follow the recording below to use the UI to create an evaluation dataset
Using the SDK
Create an evaluation dataset programmatically by searching for traces and adding them to the dataset.
import mlflow
import mlflow.genai.datasets
import time
from databricks.connect import DatabricksSession
# 0. If you are using a local development environment, connect to Serverless Spark which powers MLflow's evaluation dataset service
spark = DatabricksSession.builder.remote(serverless=True).getOrCreate()
# 1. Create an evaluation dataset
# Replace with a Unity Catalog schema where you have CREATE TABLE permission
uc_schema = "workspace.default"
# This table will be created in the above UC schema
evaluation_dataset_table_name = "email_generation_eval"
eval_dataset = mlflow.genai.datasets.create_dataset(
uc_table_name=f"{uc_schema}.{evaluation_dataset_table_name}",
)
print(f"Created evaluation dataset: {uc_schema}.{evaluation_dataset_table_name}")
Step 2: Add records to your dataset
Approach 1: Create from existing traces
One of the most effective ways to build a relevant evaluation dataset is by curating examples directly from your application's historical interactions captured by MLflow Tracing. You can create datasets from traces using either the MLflow Monitoring UI or the SDK.
Using the UI
Follow the recording below to use the UI to add existing production traces to the dataset
Using the SDK
Programmatically search for traces and then add them to the dataset. Refer to the query traces reference page for details on how to use filters in search_traces()
.
import mlflow
# 2. Search for traces
traces = mlflow.search_traces(
filter_string="attributes.status = 'OK'",
order_by=["attributes.timestamp_ms DESC"]
max_results=10
)
print(f"Found {len(traces)} successful traces")
# 3. Add the traces to the evaluation dataset
eval_dataset.merge_records(traces)
print(f"Added {len(traces)} records to evaluation dataset")
# Preview the dataset
df = eval_dataset.to_df()
print(f"\nDataset preview:")
print(f"Total records: {len(df)}")
print("\nSample record:")
sample = df.iloc[0]
print(f"Inputs: {sample['inputs']}")
Approach 2: Create from ___domain expert labels
Leverage feedback from ___domain experts captured in MLflow Labeling Sessions to enrich your evaluation datasets with ground truth labels. Before doing these steps, follow the collect ___domain expert feedback guide to create a labeling session.
import mlflow.genai.labeling as labeling
# Get a labeling sessions
all_sessions = labeling.get_labeling_sessions()
print(f"Found {len(all_sessions)} sessions")
for session in all_sessions:
print(f"- {session.name} (ID: {session.labeling_session_id})")
print(f" Assigned users: {session.assigned_users}")
# Sync from the labeling session to the dataset
all_sessions[0].sync_expectations(dataset_name=f"{uc_schema}.{evaluation_dataset_table_name}")
Approach 3: Build from scratch or import existing
You can import an existing dataset or curate examples from scratch. Your data must match (or be transformed to match) the evaluation dataset schema.
# Define comprehensive test cases
evaluation_examples = [
{
"inputs": {"question": "What is MLflow?"},
"expected": {
"expected_response": "MLflow is an open source platform for managing the end-to-end machine learning lifecycle.",
"expected_facts": [
"open source platform",
"manages ML lifecycle",
"experiment tracking",
"model deployment"
]
},
},
]
eval_dataset.merge_records(evaluation_examples)
Approach 4: Seed using synthetic data
Generating synthetic data can expand your testing efforts by quickly creating diverse inputs and covering edge cases. To learn more, visit the synthesize evaluation datasets reference.
Next steps
Continue your journey with these recommended actions and tutorials.
Evaluate your app - Use your newly created dataset for evaluation
Create custom scorers - Build scorers to evaluate against ground truth
Reference guides
Explore detailed documentation for concepts and features mentioned in this guide.
- Evaluation Datasets - Deep dive into dataset structure and capabilities
- Evaluation Harness - Learn how
mlflow.genai.evaluate()
uses your datasets - Tracing data model - Understand traces as a source for evaluation datasets