MLflow 的 GenAI 评估概念: 评分器、 评委、 评估数据集以及使用这些概念的系统。
快速参考
概念 | 目的 | 用法 |
---|---|---|
记分器 | 评估跟踪质量 |
@scorer 修饰器或 Scorer 类 |
法官 | 基于 LLM 的评估 | 包裹在计分工具中以便使用 |
评估工具 | 运行脱机评估 | mlflow.genai.evaluate() |
评估数据集 | 测试数据管理 | mlflow.genai.datasets |
评估测试 | 存储评估结果 | 由线束创建 |
生产监控 | 实时质量跟踪 | mlflow.genai.create_monitor() |
常见模式
结合使用多个评分器
import mlflow
from mlflow.genai.scorers import scorer, Safety, RelevanceToQuery
from mlflow.entities import Feedback
# Combine predefined and custom scorers
@scorer
def custom_business_scorer(outputs):
response = outputs.get("response", "")
# Your business logic
if "company_name" not in response:
return Feedback(value=False, rationale="Missing company branding")
return Feedback(value=True, rationale="Meets business criteria")
# Use same scorers everywhere
scorers = [Safety(), RelevanceToQuery(), custom_business_scorer]
# Offline evaluation
results = mlflow.genai.evaluate(
data=eval_dataset,
predict_fn=my_app,
scorers=scorers
)
# Production monitoring - same scorers!
monitor = mlflow.genai.create_monitor(
endpoint="my-production-endpoint",
scorers=scorers,
sampling_rate=0.1
)
串联评估结果
import mlflow
import pandas as pd
from mlflow.genai.scorers import Safety, Correctness
# Run initial evaluation
results1 = mlflow.genai.evaluate(
data=test_dataset,
predict_fn=my_app,
scorers=[Safety(), Correctness()]
)
# Use results to create refined dataset
traces = mlflow.search_traces(run_id=results1.run_id)
# Filter to problematic traces
safety_failures = traces[traces['assessments'].apply(
lambda x: any(a.name == 'Safety' and a.value == 'no' for a in x)
)]
# Re-evaluate with different scorers or updated app
from mlflow.genai.scorers import Guidelines
results2 = mlflow.genai.evaluate(
data=safety_failures,
predict_fn=updated_app,
scorers=[
Safety(),
Guidelines(
name="content_policy",
guidelines="Response must follow our content policy"
)
]
)
评估中的错误处理
import mlflow
from mlflow.genai.scorers import scorer
from mlflow.entities import Feedback, AssessmentError
@scorer
def resilient_scorer(outputs, trace=None):
try:
response = outputs.get("response")
if not response:
return Feedback(
value=None,
error=AssessmentError(
error_code="MISSING_RESPONSE",
error_message="No response field in outputs"
)
)
# Your evaluation logic
return Feedback(value=True, rationale="Valid response")
except Exception as e:
# Let MLflow handle the error gracefully
raise
# Use in evaluation - continues even if some scorers fail
results = mlflow.genai.evaluate(
data=dataset,
predict_fn=my_app,
scorers=[resilient_scorer, Safety()]
)
概念
记分器: mlflow.genai.scorers
from mlflow.genai.scorers import scorer
from mlflow.entities import Feedback
from typing import Optional, Dict, Any, List
@scorer
def my_custom_scorer(
*, # MLflow calls your scorer with named arguments
inputs: Optional[Dict[Any, Any]], # App's input from trace
outputs: Optional[Dict[Any, Any]], # App's output from trace
expectations: Optional[Dict[str, Any]], # Ground truth (offline only)
trace: Optional[mlflow.entities.Trace] # Complete trace
) -> int | float | bool | str | Feedback | List[Feedback]:
# Your evaluation logic
return Feedback(value=True, rationale="Explanation")
法官: mlflow.genai.judges
必须被封装在评分器中的基于 LLM 的质量评估器。
from mlflow.genai.judges import is_safe, is_relevant
from mlflow.genai.scorers import scorer
# Direct usage
feedback = is_safe(content="Hello world")
# Wrapped in scorer
@scorer
def safety_scorer(outputs):
return is_safe(content=outputs["response"])
评估工具:mlflow.genai.evaluate(...)
在开发过程中协调脱机评估。
import mlflow
from mlflow.genai.scorers import Safety, RelevanceToQuery
results = mlflow.genai.evaluate(
data=eval_dataset, # Test data
predict_fn=my_app, # Your app
scorers=[Safety(), RelevanceToQuery()], # Quality metrics
model_id="models:/my-app/1" # Optional version tracking
)
评估数据集: mlflow.genai.datasets.EvaluationDataset
版本化测试数据(可选的真实数据)。
import mlflow.genai.datasets
# Create from production traces
dataset = mlflow.genai.datasets.create_dataset(
uc_table_name="catalog.schema.eval_data"
)
# Add traces
traces = mlflow.search_traces(filter_string="trace.status = 'OK'")
dataset.insert(traces)
# Use in evaluation
results = mlflow.genai.evaluate(data=dataset, ...)
评估运行:mlflow.entities.Run
评估结果中包含反馈痕迹。
# Access evaluation results
traces = mlflow.search_traces(run_id=results.run_id)
# Filter by feedback
good_traces = traces[traces['assessments'].apply(
lambda x: all(a.value for a in x if a.name == 'Safety')
)]
生产监控: mlflow.genai.create_monitor(...)
持续评估已部署的应用程序。
import mlflow
from mlflow.genai.scorers import Safety, custom_scorer
monitor = mlflow.genai.create_monitor(
name="chatbot_monitor",
endpoint="endpoints:/my-chatbot-prod",
scorers=[Safety(), custom_scorer],
sampling_rate=0.1 # 10% of traffic
)
工作流
在线监控(生产)
# Production app with tracing → Monitor applies scorers → Feedback on traces → Dashboards
离线评估(开发)
# Test data → Evaluation harness runs app → Scorers evaluate traces → Results stored
后续步骤
继续您的旅程,并参考这些推荐的行动和教程。
- 评估应用 - 按照实践教程应用这些概念
- 使用预定义的 LLM 评分器 - 从内置质量指标开始
- 创建自定义记分器 - 根据特定需求生成评分器
参考指南
浏览有关相关概念的详细文档。