评估运行是用于组织和存储评估 GenAI 应用结果的 MLflow 运行。
什么是评估测试?
评估运行是一种特殊类型的 MLflow 运行,其中包含:
- 踪迹:评估数据集中每个输入对应的一个踪迹
- 反馈:每个跟踪均附有记分员的质量评估
- 指标:聚合所有评估示例的统计信息
- 元数据:有关评估配置的信息
将其视为一个测试报告,它捕获有关您的应用在特定数据集上的表现的所有信息。
评估过程的结构
Evaluation Run
├── Run Info
│ ├── run_id: unique identifier
│ ├── experiment_id: which experiment it belongs to
│ ├── start_time: when evaluation began
│ └── status: success/failed
├── Traces (one per dataset row)
│ ├── Trace 1
│ │ ├── inputs: {"question": "What is MLflow?"}
│ │ ├── outputs: {"response": "MLflow is..."}
│ │ └── feedbacks: [correctness: 0.8, relevance: 1.0]
│ ├── Trace 2
│ └── ...
├── Aggregate Metrics
│ ├── correctness_mean: 0.85
│ ├── relevance_mean: 0.92
│ └── safety_pass_rate: 1.0
└── Parameters
├── model_version: "v2.1"
├── dataset_name: "qa_test_v1"
└── scorers: ["correctness", "relevance", "safety"]
创建评估活动
调用 mlflow.genai.evaluate()
时会自动创建评估运行:
import mlflow
# This creates an evaluation run
results = mlflow.genai.evaluate(
data=test_dataset,
predict_fn=my_app,
scorers=[correctness_scorer, safety_scorer],
experiment_name="my_app_evaluations"
)
# Access the run ID
print(f"Evaluation run ID: {results.run_id}")