评估概念概述

MLflow 的 GenAI 评估概念: 评分器评委评估数据集以及使用这些概念的系统。

快速参考

概念 目的 用法
记分器 评估跟踪质量 @scorer 修饰器或 Scorer
法官 基于 LLM 的评估 包裹在计分工具中以便使用
评估工具 运行脱机评估 mlflow.genai.evaluate()
评估数据集 测试数据管理 mlflow.genai.datasets
评估测试 存储评估结果 由线束创建
生产监控 实时质量跟踪 mlflow.genai.create_monitor()

常见模式

结合使用多个评分器

import mlflow
from mlflow.genai.scorers import scorer, Safety, RelevanceToQuery
from mlflow.entities import Feedback

# Combine predefined and custom scorers
@scorer
def custom_business_scorer(outputs):
    response = outputs.get("response", "")
    # Your business logic
    if "company_name" not in response:
        return Feedback(value=False, rationale="Missing company branding")
    return Feedback(value=True, rationale="Meets business criteria")

# Use same scorers everywhere
scorers = [Safety(), RelevanceToQuery(), custom_business_scorer]

# Offline evaluation
results = mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=my_app,
    scorers=scorers
)

# Production monitoring - same scorers!
monitor = mlflow.genai.create_monitor(
    endpoint="my-production-endpoint",
    scorers=scorers,
    sampling_rate=0.1
)

串联评估结果

import mlflow
import pandas as pd
from mlflow.genai.scorers import Safety, Correctness

# Run initial evaluation
results1 = mlflow.genai.evaluate(
    data=test_dataset,
    predict_fn=my_app,
    scorers=[Safety(), Correctness()]
)

# Use results to create refined dataset
traces = mlflow.search_traces(run_id=results1.run_id)

# Filter to problematic traces
safety_failures = traces[traces['assessments'].apply(
    lambda x: any(a.name == 'Safety' and a.value == 'no' for a in x)
)]

# Re-evaluate with different scorers or updated app
from mlflow.genai.scorers import Guidelines

results2 = mlflow.genai.evaluate(
    data=safety_failures,
    predict_fn=updated_app,
    scorers=[
        Safety(),
        Guidelines(
            name="content_policy",
            guidelines="Response must follow our content policy"
        )
    ]
)

评估中的错误处理

import mlflow
from mlflow.genai.scorers import scorer
from mlflow.entities import Feedback, AssessmentError

@scorer
def resilient_scorer(outputs, trace=None):
    try:
        response = outputs.get("response")
        if not response:
            return Feedback(
                value=None,
                error=AssessmentError(
                    error_code="MISSING_RESPONSE",
                    error_message="No response field in outputs"
                )
            )
        # Your evaluation logic
        return Feedback(value=True, rationale="Valid response")
    except Exception as e:
        # Let MLflow handle the error gracefully
        raise

# Use in evaluation - continues even if some scorers fail
results = mlflow.genai.evaluate(
    data=dataset,
    predict_fn=my_app,
    scorers=[resilient_scorer, Safety()]
)

概念

记分器: mlflow.genai.scorers

评估跟踪并返回反馈的函数。

from mlflow.genai.scorers import scorer
from mlflow.entities import Feedback
from typing import Optional, Dict, Any, List

@scorer
def my_custom_scorer(
    *,  # MLflow calls your scorer with named arguments
    inputs: Optional[Dict[Any, Any]],  # App's input from trace
    outputs: Optional[Dict[Any, Any]],  # App's output from trace
    expectations: Optional[Dict[str, Any]],  # Ground truth (offline only)
    trace: Optional[mlflow.entities.Trace]  # Complete trace
) -> int | float | bool | str | Feedback | List[Feedback]:
    # Your evaluation logic
    return Feedback(value=True, rationale="Explanation")

了解有关记分员的详细信息 »

法官: mlflow.genai.judges

必须被封装在评分器中的基于 LLM 的质量评估器。

from mlflow.genai.judges import is_safe, is_relevant
from mlflow.genai.scorers import scorer

# Direct usage
feedback = is_safe(content="Hello world")

# Wrapped in scorer
@scorer
def safety_scorer(outputs):
    return is_safe(content=outputs["response"])

了解有关法官的详细信息 »

评估工具:mlflow.genai.evaluate(...)

在开发过程中协调脱机评估。

import mlflow
from mlflow.genai.scorers import Safety, RelevanceToQuery

results = mlflow.genai.evaluate(
    data=eval_dataset,  # Test data
    predict_fn=my_app,  # Your app
    scorers=[Safety(), RelevanceToQuery()],  # Quality metrics
    model_id="models:/my-app/1"  # Optional version tracking
)

详细了解 Evaluation Harness »

评估数据集: mlflow.genai.datasets.EvaluationDataset

版本化测试数据(可选的真实数据)。

import mlflow.genai.datasets

# Create from production traces
dataset = mlflow.genai.datasets.create_dataset(
    uc_table_name="catalog.schema.eval_data"
)

# Add traces
traces = mlflow.search_traces(filter_string="trace.status = 'OK'")
dataset.insert(traces)

# Use in evaluation
results = mlflow.genai.evaluate(data=dataset, ...)

详细了解评估数据集 »

评估运行:mlflow.entities.Run

评估结果中包含反馈痕迹。

# Access evaluation results
traces = mlflow.search_traces(run_id=results.run_id)

# Filter by feedback
good_traces = traces[traces['assessments'].apply(
    lambda x: all(a.value for a in x if a.name == 'Safety')
)]

详细了解评估流程 »

生产监控: mlflow.genai.create_monitor(...)

持续评估已部署的应用程序。

import mlflow
from mlflow.genai.scorers import Safety, custom_scorer

monitor = mlflow.genai.create_monitor(
    name="chatbot_monitor",
    endpoint="endpoints:/my-chatbot-prod",
    scorers=[Safety(), custom_scorer],
    sampling_rate=0.1  # 10% of traffic
)

详细了解生产监视 »

工作流

在线监控(生产)

# Production app with tracing → Monitor applies scorers → Feedback on traces → Dashboards

脱机评估

离线评估(开发)

# Test data → Evaluation harness runs app → Scorers evaluate traces → Results stored

脱机评估

后续步骤

继续您的旅程,并参考这些推荐的行动和教程。

参考指南

浏览有关相关概念的详细文档。