Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Evaluation runs are MLflow runs that organize and store the results of evaluating your GenAI app.
What are evaluation runs?
An evaluation run is a special type of MLflow run that contains:
- Traces: One trace for each input in your evaluation dataset
- Feedback: Quality assessments from scorers attached to each trace
- Metrics: Aggregate statistics across all evaluated examples
- Metadata: Information about the evaluation configuration
Think of it as a test report that captures everything about how your app performed on a specific dataset.
Structure of an evaluation run
Evaluation Run
├── Run Info
│ ├── run_id: unique identifier
│ ├── experiment_id: which experiment it belongs to
│ ├── start_time: when evaluation began
│ └── status: success/failed
├── Traces (one per dataset row)
│ ├── Trace 1
│ │ ├── inputs: {"question": "What is MLflow?"}
│ │ ├── outputs: {"response": "MLflow is..."}
│ │ └── feedbacks: [correctness: 0.8, relevance: 1.0]
│ ├── Trace 2
│ └── ...
├── Aggregate Metrics
│ ├── correctness_mean: 0.85
│ ├── relevance_mean: 0.92
│ └── safety_pass_rate: 1.0
└── Parameters
├── model_version: "v2.1"
├── dataset_name: "qa_test_v1"
└── scorers: ["correctness", "relevance", "safety"]
Creating evaluation runs
Evaluation runs are created automatically when you call mlflow.genai.evaluate()
:
import mlflow
# This creates an evaluation run
results = mlflow.genai.evaluate(
data=test_dataset,
predict_fn=my_app,
scorers=[correctness_scorer, safety_scorer],
experiment_name="my_app_evaluations"
)
# Access the run ID
print(f"Evaluation run ID: {results.run_id}")
Next Steps
- Evaluate your app - Create your first evaluation run
- Build evaluation datasets - Prepare data for consistent evaluation runs
- Compare evaluation runs - Learn to analyze and compare run results
- Evaluation Datasets - See what data goes into evaluation runs