Share via


Migrate to MLflow 3 from Agent Evaluation: Quick reference

This quick reference summarizes key changes for migrating from Agent Evaluation and MLflow 2 to the improved APIs in MLflow 3. See the full guide at Migrate to MLflow 3 from Agent Evaluation.

Import updates

### Old imports ###
from mlflow import evaluate
from databricks.agents.evals import metric
from databricks.agents.evals import judges

from databricks.agents import review_app

### New imports ###
from mlflow.genai import evaluate
from mlflow.genai.scorers import scorer
from mlflow.genai import judges
# For predefined scorers:
from mlflow.genai.scorers import (
    Correctness, Guidelines, ExpectationGuidelines,
    RelevanceToQuery, Safety, RetrievalGroundedness,
    RetrievalRelevance, RetrievalSufficiency
)

import mlflow.genai.labeling as labeling
import mlflow.genai.label_schemas as schemas

Evaluation function

MLflow 2.x MLflow 3.x
mlflow.evaluate() mlflow.genai.evaluate()
model=my_agent predict_fn=my_agent
model_type="databricks-agent" (not needed)
extra_metrics=[...] scorers=[...]
evaluator_config={...} (configuration in scorers)

Judge selection

MLflow 2.x MLflow 3.x
Automatically runs all applicable judges based on data Must explicitly specify which scorers to use
Use evaluator_config to limit judges Pass desired scorers in scorers parameter
global_guidelines in config Use Guidelines() scorer
Judges selected based on available data fields You control exactly which scorers run

Data fields

MLflow 2.x Field MLflow 3.x Field Description
request inputs Agent input
response outputs Agent output
expected_response expectations Ground truth
retrieved_context Accessed via traces Context from trace
guidelines Part of scorer config Moved to scorer level

Custom metrics and scorers

MLflow 2.x MLflow 3.x Notes
@metric decorator @scorer decorator New name
def my_metric(request, response, ...) def my_scorer(inputs, outputs, expectations, traces) Simplified
Multiple expected_* params Single expectations param that is a dict Consolidated
custom_expected Part of expectations dict Simplified
request parameter inputs parameter Consistent naming
response parameter outputs parameter Consistent naming

Result access

MLflow 2.x MLflow 3.x
results.tables['eval_results'] mlflow.search_traces(run_id=results.run_id)
Direct DataFrame access Iterate through traces and assessments

LLM judges

Use Case MLflow 2.x MLflow 3.x Recommended
Basic correctness check judges.correctness() in @metric Correctness() scorer or judges.is_correct() judge
Safety evaluation judges.safety() in @metric Safety() scorer or judges.is_safe() judge
Global guidelines judges.guideline_adherence() Guidelines() scorer or judges.meets_guidelines() judge
Per-eval-set-row guidelines judges.guideline_adherence() with expected_* ExpectationGuidelines() scorer or judges.meets_guidelines() judge
Check for factual support judges.groundedness() judges.is_grounded() or RetrievalGroundedness() scorer
Check relevance of context judges.relevance_to_query() judges.is_context_relevant() or RelevanceToQuery() scorer
Check relevance of context chunks judges.chunk_relevance() judges.is_context_relevant() or RetrievalRelevance() scorer
Check completeness of context judges.context_sufficiency() judges.is_context_sufficient() or RetrievalSufficiency() scorer
Complex custom logic Direct judge calls in @metric Predefined scorers or @scorer with judge calls

Human feedback

MLflow 2.x MLflow 3.x
databricks.agents.review_app mlflow.genai.labeling
databricks.agents.datasets mlflow.genai.datasets
review_app.label_schemas.* mlflow.genai.label_schemas.*
app.create_labeling_session() labeling.create_labeling_session()

Common migration commands

# Find old evaluate calls
grep -r "mlflow.evaluate" . --include="*.py"

# Find old metric decorators
grep -r "@metric" . --include="*.py"

# Find old data fields
grep -r '"request":\|"response":\|"expected_response":' . --include="*.py"

# Find old imports
grep -r "databricks.agents" . --include="*.py"

Additional resources

For additional support during migration, consult the MLflow documentation or reach out to your Databricks support team.