This quick reference summarizes key changes for migrating from Agent Evaluation and MLflow 2 to the improved APIs in MLflow 3.
See the full guide at Migrate to MLflow 3 from Agent Evaluation.
Import updates
### Old imports ###
from mlflow import evaluate
from databricks.agents.evals import metric
from databricks.agents.evals import judges
from databricks.agents import review_app
### New imports ###
from mlflow.genai import evaluate
from mlflow.genai.scorers import scorer
from mlflow.genai import judges
# For predefined scorers:
from mlflow.genai.scorers import (
Correctness, Guidelines, ExpectationGuidelines,
RelevanceToQuery, Safety, RetrievalGroundedness,
RetrievalRelevance, RetrievalSufficiency
)
import mlflow.genai.labeling as labeling
import mlflow.genai.label_schemas as schemas
Evaluation function
MLflow 2.x |
MLflow 3.x |
mlflow.evaluate() |
mlflow.genai.evaluate() |
model=my_agent |
predict_fn=my_agent |
model_type="databricks-agent" |
(not needed) |
extra_metrics=[...] |
scorers=[...] |
evaluator_config={...} |
(configuration in scorers) |
Judge selection
MLflow 2.x |
MLflow 3.x |
Automatically runs all applicable judges based on data |
Must explicitly specify which scorers to use |
Use evaluator_config to limit judges |
Pass desired scorers in scorers parameter |
global_guidelines in config |
Use Guidelines() scorer |
Judges selected based on available data fields |
You control exactly which scorers run |
Data fields
MLflow 2.x Field |
MLflow 3.x Field |
Description |
request |
inputs |
Agent input |
response |
outputs |
Agent output |
expected_response |
expectations |
Ground truth |
retrieved_context |
Accessed via traces |
Context from trace |
guidelines |
Part of scorer config |
Moved to scorer level |
Custom metrics and scorers
MLflow 2.x |
MLflow 3.x |
Notes |
@metric decorator |
@scorer decorator |
New name |
def my_metric(request, response, ...) |
def my_scorer(inputs, outputs, expectations, traces) |
Simplified |
Multiple expected_* params |
Single expectations param that is a dict |
Consolidated |
custom_expected |
Part of expectations dict |
Simplified |
request parameter |
inputs parameter |
Consistent naming |
response parameter |
outputs parameter |
Consistent naming |
Result access
MLflow 2.x |
MLflow 3.x |
results.tables['eval_results'] |
mlflow.search_traces(run_id=results.run_id) |
Direct DataFrame access |
Iterate through traces and assessments |
LLM judges
Use Case |
MLflow 2.x |
MLflow 3.x Recommended |
Basic correctness check |
judges.correctness() in @metric |
Correctness() scorer or judges.is_correct() judge |
Safety evaluation |
judges.safety() in @metric |
Safety() scorer or judges.is_safe() judge |
Global guidelines |
judges.guideline_adherence() |
Guidelines() scorer or judges.meets_guidelines() judge |
Per-eval-set-row guidelines |
judges.guideline_adherence() with expected_* |
ExpectationGuidelines() scorer or judges.meets_guidelines() judge |
Check for factual support |
judges.groundedness() |
judges.is_grounded() or RetrievalGroundedness() scorer |
Check relevance of context |
judges.relevance_to_query() |
judges.is_context_relevant() or RelevanceToQuery() scorer |
Check relevance of context chunks |
judges.chunk_relevance() |
judges.is_context_relevant() or RetrievalRelevance() scorer |
Check completeness of context |
judges.context_sufficiency() |
judges.is_context_sufficient() or RetrievalSufficiency() scorer |
Complex custom logic |
Direct judge calls in @metric |
Predefined scorers or @scorer with judge calls |
Human feedback
MLflow 2.x |
MLflow 3.x |
databricks.agents.review_app |
mlflow.genai.labeling |
databricks.agents.datasets |
mlflow.genai.datasets |
review_app.label_schemas.* |
mlflow.genai.label_schemas.* |
app.create_labeling_session() |
labeling.create_labeling_session() |
Common migration commands
# Find old evaluate calls
grep -r "mlflow.evaluate" . --include="*.py"
# Find old metric decorators
grep -r "@metric" . --include="*.py"
# Find old data fields
grep -r '"request":\|"response":\|"expected_response":' . --include="*.py"
# Find old imports
grep -r "databricks.agents" . --include="*.py"
Additional resources
For additional support during migration, consult the MLflow documentation or reach out to your Databricks support team.