基于指南的 LLM 评分器

2025-06-11

概述

基于指南的评委和评分员使用通过/失败的自然语言标准来评估 GenAI 输出。他们擅长评估：

符合性：“不得包含定价信息”
风格/语气：“保持专业，同情的语气”
要求：“必须包括特定免责声明”
准确性：“仅使用提供上下文中的事实”

优点

商业友好：域专家编写标准而不编码
灵活：在不更改代码的情况下更新条件
可解释：明确的通过/失败条件
快速迭代：快速测试新条件

使用指南的三种方法

MLflow 提供了三种使用基于准则的法官的方法：

预生成的 Guidelines() 评分器：将全局准则统一应用于所有行。仅评估应用输入/输出。适用于离线评估和生产监控。
预生成的 ExpectationsGuidelines() 评分器：应用域专家在评估数据集中标记的逐行指南。仅评估应用输入/输出。仅用于脱机评估。
judges.meets_guidelines() SDK：将准则应用于任何跟踪数据（工具调用、检索的上下文等）。必须封装在自定义评分器中，才能用于评估或监控。

指南的工作原理

基于指南的法官使用经过特别调整的 LLM 来评估文本是否符合指定的条件。法官：

接收上下文：包含要评估的数据的任何 JSON 字典（例如请求、响应、retrieved_documents、user_preferences）。可以在指南中直接按名称引用这些密钥 - 请参阅详细示例
应用准则：定义通过/未通过条件的自然语言规则
做出判断：返回具有详细理由的二进制通过/失败分数

详细了解支持 LLM 评审的模型。

运行示例的先决条件

安装 MLflow 和所需包

pip install --upgrade "mlflow[databricks]>=3.1.0"

请按照设置环境快速指南创建 MLflow 试验。

1. 预生成的 `Guidelines()` 记分器：全球准则

Guidelines 评分器在评估的所有行或者生产监控中的踪迹上应用统一准则。它会自动从跟踪中提取请求/响应数据，并根据准则对其进行评估。

重要

请参阅如何使用预生成指导原则评分器指南，获取有关使用此方法的端到端教程。

何时使用

在以下情况下使用预生成的记分器：

指南只需要应用程序的输入和输出
你的追踪具有标准输入/输出格式
需要快速设置而不使用自定义代码

例子

在指南中，将应用的输入标记为request，将应用的输出标记为response。

from mlflow.genai.scorers import Guidelines
import mlflow

# Example data
data = [
    {
        "inputs": {"question": "What is the capital of France?"},
        "outputs": {"response": "The capital of France is Paris."}
    },
    {
        "inputs": {"question": "What is the capital of Germany?"},
        "outputs": {"response": "The capital of Germany is Berlin."}
    }
]

# Create scorers with global guidelines
english = Guidelines(
    name="english",
    guidelines=["The response must be in English"]
)

clarity = Guidelines(
    name="clarity",
    guidelines=["The response must be clear, coherent, and concise"]
)

# Evaluate with global guidelines
results = mlflow.genai.evaluate(
    data=data,
    scorers=[english, clarity]
)

from mlflow.genai.scorers import Guidelines
import mlflow


# Create evaluation dataset with pre-computed outputs
eval_dataset = [
    {
        "inputs": {
            "messages": [{"role": "user", "content": "My order hasn't arrived yet"}]
        },
        "outputs": {
            "choices": [{
                "message": {
                    "content": "I understand your concern about the delayed order. Let me help you track it right away."
                }
            }]
        },
    },
    {
        "inputs": {
            "messages": [{"role": "user", "content": "How do I reset my password?"}]
        },
        "outputs": {
            "choices": [{
                "message": {
                    "content": "To reset your password, click 'Forgot Password' on the login page. You'll receive an email within 5 minutes."
                }
            }]
        },
    }
]

# Run evaluation on existing outputs
results = mlflow.genai.evaluate(
    data=eval_dataset,
    scorers=[Guidelines(name="tone", guidelines="The response must maintain a courteous, respectful tone throughout. It must show empathy for customer concerns in the request"),]
)

参数

参数	类型	必选	DESCRIPTION
`name`	`str`	是的	评分器的名称，显示在评估结果中
`guidelines`	`str \\| list[str]`	是的	统一应用于所有行的准则

具有音调和准确性的高级示例

预生成的记分器如何分析应用的输入/输出

评分程序会自动从跟踪中提取数据，以用键 request 和 response 创建指南上下文。

请求

从您的 request 中提取了 inputs 字段。

inputs如果包含一个带有OpenAI格式聊天消息数组的messages键，则：
- 如果存在单条消息，request 是该消息的 content
- 如果消息数量超过 2 个，request 是整个消息数组序列化为 JSON 字符串的结果。
否则，request 是整个 inputs 字典序列化成的 JSON 字符串。

例子

单个消息输入：

# Input
inputs = {
    "messages": [
        {"role": "user", "content": "How can I reset my password?"}
    ]
}

# Parsed request
"How can I reset my password?"

多轮次对话：

# Input
inputs = {
    "messages": [
        {"role": "user", "content": "What is MLflow?"},
        {"role": "assistant", "content": "MLflow is an open source platform..."},
        {"role": "user", "content": "Tell me more about tracing"}
    ]
}

# Parsed request (JSON string)
'[{"role": "user", "content": "What is MLflow?"}, {"role": "assistant", "content": "MLflow is an open source platform..."}, {"role": "user", "content": "Tell me more about tracing"}]'

任意字典：

# Input
inputs = {"key1": "Explain MLflow evaluation", "key2": "something else"}

# Parsed request
'{"key1": "Explain MLflow evaluation", "key2": "something else"}'

响应

字段 response 是从你的 outputs 中提取的。

如果您的 outputs 包含一个 OpenAI 格式的 ChatCompletions 对象：
- response 是第一选择 content
如果你的outputs包含一个带有OpenAI格式聊天消息数组的messages键
- response 是最后一条消息 content
否则，response 是被序列化为 JSON 字符串的 outputs。

例子

ChatCompletion 输出：

# Output (simplified)
outputs = {
    "choices": [{
        "message": {
            "content": "MLflow evaluation helps measure GenAI quality..."
        }
    }]
}

# Parsed response
"MLflow evaluation helps measure GenAI quality..."

消息格式输出：

# Output
outputs = {
    "messages": [
        {"role": "user", "content": "What are the ..."}
        {"role": "assistant", "content": "Here are the key features..."}
    ]
}

# Parsed response
"Here are the key features..."

任意字典：

# Input
inputs = {"key1": "Explain MLflow evaluation", "key2": "something else"}

# Parsed request
'{"key1": "Explain MLflow evaluation", "key2": "something else"}'

2. 预先建立的 `ExpectationsGuidelines()` 记分器：逐行指南

根据领域专家针对特定行的准则，ExpectationsGuidelines 评分器进行评估。这允许为数据集中的每个示例设置不同的评估标准。

何时使用

在以下情况下使用此记分器：

你拥有使用自定义准则标记特定示例的域专家
不同的行需要不同的评估条件

示例：

在指南中，将应用的输入标记为request，将应用的输出标记为response。

from mlflow.genai.scorers import ExpectationsGuidelines
import mlflow

# Dataset with per-row guidelines
data = [
    {
        "inputs": {"question": "What is the capital of France?"},
        "outputs": "The capital of France is Paris.",
        "expectations": {
            "guidelines": ["The response must be factual and concise"]
        }
    },
    {
        "inputs": {"question": "How to learn Python?"},
        "outputs": "You can read a book or take a course.",
        "expectations": {
            "guidelines": ["The response must be helpful and encouraging"]
        }
    }
]

# Evaluate with per-row guidelines
results = mlflow.genai.evaluate(
    data=data,
    scorers=[ExpectationsGuidelines()]
)

参数

不需要参数。自动从 expectations/guidelines 数据集中的字段读取指南。单行的所有准则会汇集在一起，作为对审查标准的一次评估调用。

预生成的记分器如何分析应用的输入/输出

解析与Guidelines()评分器解析相同。

3. `judges.meets_guidelines()` SDK（软件开发工具包）

对于复杂的评估方案， judges.meets_guidelines() 请直接在自定义评分器中使用 API。这为评估 GenAI 应用程序的任何方面提供了最大的灵活性。

重要

遵循创建包装指南判定 API 的自定义记分器的方法，获取关于使用此方法的端到端教程。

何时使用

需要使用 API 时：

评估超出输入/输出的数据（检索的文档、工具调用、元数据）
从评估中排除某些字段（用户 ID、时间戳）
对数据的不同部分应用不同的准则
将多个准则评估与自定义逻辑相结合

示例：

from mlflow.genai.judges import meets_guidelines

feedback = meets_guidelines(
    name="factual_accuracy",
    guidelines="The response must only use facts from retrieved_documents. It is OK if the response references concepts from the request even if those concepts are not in the retrieved_documents.",
    context={
        "request": "What products did the customer Acme Co purchase?",
        "response": "Acme Co purchased laptop and mice",
        "retrieved_documents": ["laptop", "mouse"]
    }
)

print(feedback.value)
print(feedback.rationle)

参数

参数	类型	必选	DESCRIPTION
`guidelines`	`str \\| list[str]`	是的	需要评估的一个准则或一组准则。每个准则应定义为明确的合格/不合格条件。
`context`	`dict[str, Any]`	是的	包含要评估的数据的字典。常用键包括 `request`、`response`、`retrieved_documents` 等。可以直接在指南中引用任何键。
`name`	`str \\| None`	否	反馈的可选自定义名称，显示在 MLflow UI 中，用于标识法官的输出。默认为自动生成的名称。

自定义记分器示例

from mlflow.genai.scorers import scorer
from mlflow.genai.judges import meets_guidelines
import mlflow

@scorer
def tone_and_accuracy(inputs, outputs):
    """Custom scorer that evaluates tone and accuracy using guidelines"""
    # Note: In production, you'd extract retrieved_documents from the trace.
    # For this example, we're passing it directly in outputs for simplicity.
    retrieved_docs = outputs.get("retrieved_documents", [])

    # Evaluate multiple aspects
    feedbacks = []

    # Check tone
    tone_feedback = meets_guidelines(
        name="professional_tone",
        guidelines="The response must maintain a professional, helpful tone",
        context={
            "request": inputs.get("question"),
            "response": outputs.get("answer")
        }
    )
    feedbacks.append(tone_feedback)

    # Check accuracy against retrieved documents
    if retrieved_docs:
        accuracy_feedback = meets_guidelines(
            name="factual_accuracy",
            guidelines="The response must only use facts from retrieved_documents",
            context={
                "request": inputs.get("question"),
                "response": outputs.get("answer"),
                "retrieved_documents": retrieved_docs
            }
        )
        feedbacks.append(accuracy_feedback)

    return feedbacks

# Example evaluation dataset
eval_data = [
    {
        "inputs": {
            "question": "What is MLflow's primary purpose?"
        },
        "outputs": {
            "answer": "MLflow is an open-source platform designed to manage the ML lifecycle, including experimentation, reproducibility, and deployment.",
            "retrieved_documents": [
                "MLflow is an open-source platform for managing the end-to-end machine learning lifecycle.",
                "It provides tools for tracking experiments, packaging code, and deploying models."
            ]
        }
    }
]

# Run evaluation with the custom scorer
results = mlflow.genai.evaluate(
    data=eval_data,
    scorers=[tone_and_accuracy]
)

返回值

所有方法都返回一个包含以下内容的mlflow.entities.Feedback对象：

value：要么"yes"（符合准则）要么"no"（不符合准则）
rationale：详细说明内容传递或失败的原因
name：评估名称（提供或自动生成）
error：评估失败时的错误详细信息

编写有效的准则

编写良好的准则对于准确评估至关重要。遵循以下最佳做法：

引用上下文变量

将上下文词典中的任何关键词直接包含在指南中：

# Example 1: Validate against retrieved documents
context = {
    "request": "What is the refund policy?",
    "response": "You can return items within 30 days for a full refund.",
    "retrieved_documents": ["Policy: Returns accepted within 30 days", "Policy: No refunds after 30 days"]
}
guideline = "The response must only include information from retrieved_documents"

# Example 2: Check user preferences
context = {
    "request": "Recommend a restaurant",
    "response": "I suggest trying the new steakhouse downtown",
    "user_preferences": {"dietary_restrictions": "vegetarian", "cuisine": "Italian"}
}
guideline = "The response must respect user_preferences when making recommendations"

# Example 3: Enforce business rules
context = {
    "request": "Can you apply a discount?",
    "response": "I've applied a 15% discount to your order",
    "max_allowed_discount": 10,
    "user_tier": "silver"
}
guideline = "The response must not exceed max_allowed_discount for the user_tier"

# Example 4: Multiple constraints
context = {
    "request": "Tell me about product features",
    "response": "This product includes features A, B, and C",
    "approved_features": ["A", "B", "C", "D"],
    "deprecated_features": ["X", "Y", "Z"]
}
guideline = """The response must:
- Only mention approved_features
- Not include deprecated_features"""

最佳做法

具体且可衡量
✅ “响应不得包含特定定价金额或百分比”
❌ “不要谈论钱”

使用明确的通过/不通过条件
✅ “如果系统询问定价问题，响应必须将用户定向到定价页”
❌ “适当处理定价问题”

明确引用上下文
✅ “响应只能使用retrieved_context中存在的事实”
❌ “要讲事实”

结构复杂要求

guideline = """The response must:
- Include a greeting if first message
- Address the user's specific question
- End with an offer to help further
- Not exceed 150 words"""

实际示例

客户服务聊天机器人

下面是跨不同方案评估客户服务聊天机器人的实际指南示例：

所有互动的全局准则

from mlflow.genai.scorers import Guidelines
import mlflow

# Define global standards for all customer interactions
tone_guidelines = Guidelines(
    name="customer_service_tone",
    guidelines="""The response must maintain our brand voice which is:
    - Professional yet warm and conversational (avoid corporate jargon)
    - Empathetic, acknowledging emotional context before jumping to solutions
    - Proactive in offering help without being pushy

    Specifically:
    - If the customer expresses frustration, anger, or disappointment, the first sentence must acknowledge their emotion
    - The response must use "I" statements to take ownership (e.g., "I understand" not "We understand")
    - The response must avoid phrases that minimize concerns like "simply", "just", or "obviously"
    - The response must end with a specific next step or open-ended offer to help, not generic closings"""
)

compliance_guidelines = Guidelines(
    name="policy_compliance",
    guidelines=[
        """Refund and return policies:
        - The response must not promise refunds beyond the 30-day return window
        - The response must mention restocking fees for electronics (15%) if discussing returns
        - For items over $500, the response must note that manager approval is required
        - The response must not waive shipping fees without checking customer loyalty tier""",

        """Data privacy and security:
        - The response must never ask for full credit card numbers, SSN, or passwords
        - The response must not reference other customers' orders or information
        - When discussing order details, the response must only reference the last 4 digits of payment methods
        - The response must direct customers to secure portal for updating payment information""",

        """Commitment limitations:
        - The response must not guarantee specific delivery dates without system verification
        - The response must include disclaimer about "estimated" times when discussing shipping
        - The response must not promise price matches without mentioning the verification process
        - For out-of-stock items, the response must not commit to restock dates"""
    ]
)

# Example customer service data
customer_interactions = [
    {
        "inputs": {
            "messages": [{"role": "user", "content": "I've been waiting 2 hours on hold! This is ridiculous!"}]
        },
        "outputs": {
            "choices": [{
                "message": {
                    "content": "I completely understand your frustration about the long wait time. That's absolutely not the experience we want you to have. I'm here to help you right now - what can I assist you with today?"
                }
            }]
        }
    },
    {
        "inputs": {
            "messages": [{"role": "user", "content": "Can I return my laptop? I bought it 45 days ago."}]
        },
        "outputs": {
            "choices": [{
                "message": {
                    "content": "I'd be happy to look into return options for you. Since our standard return window is 30 days and your purchase was 45 days ago, I'll need to check if there are any exceptions we can make. For electronics like laptops, we do have a 15% restocking fee. May I have your order number to review your specific situation?"
                }
            }]
        }
    }
]

# Evaluate customer service interactions
results = mlflow.genai.evaluate(
    data=customer_interactions,
    scorers=[tone_guidelines, compliance_guidelines]
)

特定情境的每行指导方针

from mlflow.genai.scorers import ExpectationsGuidelines
import mlflow

# Dataset with scenario-specific guidelines
customer_service_data = [
    {
        "inputs": {
            "messages": [{"role": "user", "content": "My package is 3 weeks late and I need it for an event tomorrow!"}]
        },
        "outputs": {
            "choices": [{
                "message": {
                    "content": "I can absolutely understand how stressful this must be, especially with your event tomorrow. I'm so sorry your package has been delayed for 3 weeks - that's completely unacceptable. Let me help you immediately. I'm checking our local store inventory right now for same-day pickup options. I'm also processing a full refund of your shipping costs and applying a 25% discount to your order for this inconvenience. Would you prefer overnight delivery to arrive by 10 AM tomorrow, or would same-day pickup from our downtown ___location work better for you? I'm also connecting you with our escalation team - you'll receive a call within the next hour at the number on your account."
                }
            }]
        },
        "expectations": {
            "guidelines": [
                """The response must handle this urgent situation with exceptional care:
                - First acknowledge both the delay AND the urgent tomorrow deadline
                - Express genuine empathy using phrases like "I can absolutely understand how stressful this must be"
                - Must NOT make excuses or blame shipping partners""",

                """The response must provide immediate actionable solutions:
                - Offer to check local store availability for same-day pickup
                - If available, offer overnight shipping at no charge given the delay
                - Provide direct escalation path: manager callback within 2 hours
                - Include the executive customer service email for formal complaints""",

                """The response must include compensation options without being asked:
                - Mention automatic refund of original shipping costs
                - Offer 20-30% discount on current order due to inconvenience
                - Provide credit for future purchase if customer prefers"""
            ]
        }
    },
    {
        "inputs": {
            "messages": [{"role": "user", "content": "How do I cancel my subscription?"}]
        },
        "outputs": {
            "choices": [{
                "message": {
                    "content": "I can help you cancel your subscription right away. Here's how: 1) Click 'Account Settings' in the top right corner, 2) Select 'Subscription' from the menu, 3) Click 'Cancel Subscription' and confirm. Your access will continue until March 15, 2024 (the end of your current billing period), and you won't be charged again. All your saved data and preferences will be kept for 30 days if you decide to return. You're entitled to a prorated refund of $12.50 for the unused portion of this month. Alternatively, if you'd prefer, I can offer you a 50% discount for the next 3 months or help you switch to our Basic plan which is $4.99/month."
                }
            }]
        },
        "expectations": {
            "guidelines": [
                """The response must respect the customer's decision to cancel:
                - No guilt-inducing language or excessive retention attempts
                - No hiding the cancellation process behind multiple steps
                - Must not require phone calls if account was created online
                - Acknowledge their right to cancel immediately without questions""",

                """The response must provide complete cancellation information:
                - State the exact steps (maximum 3) to cancel online
                - Clarify the last billing date and when access ends
                - Explain what happens to their data, saved preferences, and history
                - Mention the 30-day reactivation window where data is preserved
                - Note any partial month refunds they're entitled to""",

                """The response may include ONE gentle retention offer:
                - Can mention a pause option instead of cancellation
                - Can offer a one-time discount or downgrade option
                - Must present this as optional, after the cancellation instructions
                - Must use soft language like 'alternatively' or 'if you'd prefer'"""
            ]
        }
    },
    {
        "inputs": {
            "messages": [{"role": "user", "content": "I was charged twice for my order #12345"}]
        },
        "outputs": {
            "choices": [{
                "message": {
                    "content": "I see the duplicate charge on your order #12345, and I'll resolve this for you right away. I've initiated the refund for the duplicate charge of $157.99. For credit cards, you'll see this reflected in 3-5 business days, and you should see a pending reversal on your account within 24 hours. I'm sending you a confirmation email right now with the refund transaction ID (REF-789012) and all the details. Your case reference number is CS-456789 if you need to follow up. Since this occurred near month-end, if this causes any overdraft fees, please let us know - we'll reimburse up to $35 in bank fees. Our billing team's direct line is 1-800-555-0123 ext 2 if you need immediate assistance. This won't affect your credit or any future orders with us, and we're investigating our payment system to prevent this from happening again."
                }
            }]
        },
        "expectations": {
            "guidelines": [
                """The response must immediately validate the customer's concern:
                - Acknowledge the duplicate charge without skepticism
                - Must not ask for proof or screenshots initially
                - Express understanding of the inconvenience and potential financial impact
                - Take ownership with phrases like 'I'll resolve this for you right away'""",

                """The response must provide specific resolution details:
                - State exact refund timeline (e.g., '3-5 business days for credit cards, 5-7 for debit')
                - Mention that they'll see a pending reversal within 24 hours
                - Offer to send detailed confirmation email with transaction IDs
                - Provide a reference number for this billing dispute
                - Include the direct billing department contact for follow-up""",

                """The response must address potential concerns proactively:
                - If near month-end, acknowledge potential impact on their budget
                - Offer to provide a letter for their bank if overdraft fees occurred
                - Mention our overdraft reimbursement policy (up to $35)
                - Assure that this won't affect their credit or future orders
                - Note that we're investigating to prevent future occurrences"""
            ]
        }
    }
]

results = mlflow.genai.evaluate(
    data=customer_service_data,
    scorers=[ExpectationsGuidelines()]
)

用于复杂评估的自定义评分器

from mlflow.genai.scorers import scorer
from mlflow.genai.judges import meets_guidelines
import mlflow

@scorer
def customer_service_quality(inputs, outputs, trace=None):
    """Comprehensive customer service quality evaluation"""
    feedbacks = []

    # Extract context from trace if available
    # Note: In production, customer_history and policies would come from your trace
    # For this example, we're simulating with data from outputs
    customer_history = outputs.get("customer_history", {})
    retrieved_policies = outputs.get("retrieved_policies", [])

    # Check response appropriateness based on customer tier
    if customer_history.get("tier") == "premium":
        premium_feedback = meets_guidelines(
            name="premium_service_standards",
            guidelines="""Premium customer service requirements:
            - The response must acknowledge their premium/VIP status within the first two sentences
            - Must offer direct line or priority queue access (wait time <2 minutes)
            - Must mention their dedicated account manager by name if assigned
            - If customer lifetime value exceeds $10,000, must offer executive escalation

            Based on customer_tenure:
            - If tenure > 5 years: mention their loyalty and years with us
            - If tenure > 10 years: must offer founding member benefits

            Resolution authority:
            - Can offer up to 50% discount without manager approval
            - Can expedite shipping to overnight at no cost
            - Can waive fees up to $200 per incident
            - Must not use any templates that contain [CUSTOMER_NAME] placeholders""",
            context={
                "request": inputs.get("question"),
                "response": outputs.get("answer"),
                "customer_tier": customer_history.get("tier"),
                "customer_tenure": customer_history.get("years_active", 0),
                "lifetime_value": customer_history.get("ltv", 0),
                "account_manager": customer_history.get("account_manager", "")
            }
        )
        feedbacks.append(premium_feedback)

    # Verify policy accuracy
    if retrieved_policies:
        policy_feedback = meets_guidelines(
            name="policy_accuracy",
            guidelines="The response must only reference policies present in retrieved_policies and must not contradict any policy statements",
            context={
                "request": inputs.get("question"),
                "response": outputs.get("answer"),
                "retrieved_policies": retrieved_policies
            }
        )
        feedbacks.append(policy_feedback)

    # Check for appropriate escalation
    if any(word in inputs.get("question", "").lower() for word in ["lawyer", "legal", "sue", "unacceptable"]):
        escalation_feedback = meets_guidelines(
            name="escalation_handling",
            guidelines=[
                "The response must acknowledge the seriousness of the concern",
                "The response must offer to connect with a senior representative",
                "The response must not be defensive or dismissive"
            ],
            context={
                "request": inputs.get("question"),
                "response": outputs.get("answer")
            }
        )
        feedbacks.append(escalation_feedback)

    return feedbacks

# Example evaluation with custom scorer
customer_eval_data = [
    {
        "inputs": {
            "question": "Why haven't I received my refund yet? I'm thinking about calling my lawyer!"
        },
        "outputs": {
            "answer": "I understand how serious this situation is for you, and I sincerely apologize for the delay with your refund. I'd like to connect you immediately with our senior customer resolution specialist who has the authority to expedite your case. They will call you within the next 30 minutes at your registered phone number. In the meantime, I'm escalating your refund to our priority queue for immediate processing.",
            "customer_history": {
                "tier": "premium",
                "years_active": 7,
                "ltv": 15000,
                "account_manager": "Sarah Johnson"
            },
            "retrieved_policies": [
                "Refunds are typically processed within 5-7 business days",
                "Premium customers are eligible for expedited refund processing",
                "Legal escalations should be directed to legal@company.com"
            ]
        }
    },
    {
        "inputs": {
            "question": "How do I update my payment method?"
        },
        "outputs": {
            "answer": "I'll help you update your payment method. For security purposes, please visit our secure payment portal at secure.company.com/payment. You'll need to log in with your account credentials, then navigate to 'Payment Methods' where you can add or update your card information. The system uses bank-level encryption to protect your data.",
            "customer_history": {
                "tier": "standard",
                "years_active": 2
            },
            "retrieved_policies": [
                "Payment information must only be updated through secure channels",
                "Never collect payment information via chat or email"
            ]
        }
    }
]

# Run evaluation
results = mlflow.genai.evaluate(
    data=customer_eval_data,
    scorers=[customer_service_quality]
)

文档提取应用

下面是用于评估文档提取应用程序的实际指南示例：

提取质量的全球准则

from mlflow.genai.scorers import Guidelines
import mlflow

# Define extraction accuracy standards
extraction_accuracy = Guidelines(
    name="extraction_accuracy",
    guidelines=[
        """Field extraction completeness and accuracy:
        - The response must extract ALL requested fields, using exact values from source
        - For ambiguous data, the response must extract the most likely value and include a confidence score
        - When multiple values exist for one field (e.g., multiple addresses), extract all and label them
        - Preserve original formatting for IDs, reference numbers, and codes (including leading zeros)
        - For missing fields, use null with reason: {"field": null, "reason": "not_found"} """,

        """Numerical and financial data handling:
        - Currency values must preserve exact decimal places as shown in source
        - Must differentiate between currencies if multiple are present (USD, EUR, etc.)
        - Percentage values must clarify if they're decimals (0.15) or percentages (15%)
        - For calculated fields (totals, tax), must match source exactly - no recalculation
        - Negative values must be preserved with proper notation (-$100 or ($100))""",

        """Entity recognition and validation:
        - Company names must be extracted exactly as written (including suffixes like Inc., LLC)
        - Person names must preserve original order and formatting
        - Must not merge similar entities (e.g., "John Smith" and "J. Smith" are kept separate)
        - Email addresses and phone numbers must be validated for basic format
        - Physical addresses must include all components present in source"""
    ]
)

format_compliance = Guidelines(
    name="output_format",
    guidelines="""Output structure must meet these enterprise data standards:

    JSON Structure Requirements:
    - Must be valid JSON that passes strict parsing
    - All field names must use snake_case consistently
    - Nested objects must maintain hierarchy from source document
    - Arrays must be used for multiple values, never concatenated strings

    Data Type Standards:
    - Dates: ISO 8601 format (YYYY-MM-DD) with timezone if available
    - Timestamps: ISO 8601 with time (YYYY-MM-DDTHH:MM:SSZ)
    - Currency: {"amount": 1234.56, "currency": "USD", "formatted": "$1,234.56"}
    - Phone: {"number": "+14155551234", "formatted": "(415) 555-1234", "type": "mobile"}
    - Boolean: true/false (not "yes"/"no" or 1/0)

    Metadata Requirements:
    - Include extraction_timestamp in UTC
    - Include source_page for multi-page documents
    - Include confidence_score (0-1) for each ML-extracted field
    - Include validation_flags array for any data quality issues detected"""
)

# Example document extraction data
extraction_tasks = [
    {
        "inputs": {
            "document_text": "Invoice #INV-2024-001\nDate: 2024-01-15\nBill To: Acme Corp\n123 Main St, Suite 100\nAnytown, CA 94000\n\nItems:\n- Widget Pro (SKU: WP-100) - Qty: 10 x $50.00 = $500.00\n- Service Fee - $100.00\n\nSubtotal: $600.00\nTax (8.75%): $52.50\nTotal: $652.50\n\nDue Date: 2024-02-15\nPayment Terms: Net 30",
            "fields_to_extract": ["invoice_number", "customer", "total_amount", "due_date", "line_items"]
        },
        "outputs": {
            "invoice_number": "INV-2024-001",
            "customer": {
                "name": "Acme Corp",
                "address": {
                    "street": "123 Main St, Suite 100",
                    "city": "Anytown",
                    "state": "CA",
                    "zip": "94000"
                }
            },
            "total_amount": {
                "amount": 652.50,
                "currency": "USD",
                "formatted": "$652.50"
            },
            "due_date": "2024-02-15",
            "line_items": [
                {
                    "description": "Widget Pro",
                    "sku": "WP-100",
                    "quantity": 10,
                    "unit_price": 50.00,
                    "total": 500.00
                },
                {
                    "description": "Service Fee",
                    "quantity": 1,
                    "unit_price": 100.00,
                    "total": 100.00
                }
            ],
            "extraction_timestamp": "2024-01-20T10:30:00Z",
            "source_page": 1,
            "confidence_score": 0.95
        }
    },
    {
        "inputs": {
            "document_text": "Contract between TechStart Inc. and CloudProvider LLC\nEffective Date: January 1, 2024\nContract ID: C-2024-789\n\nThis agreement outlines cloud hosting services...\nMonthly Fee: €5,000\nContract Term: 24 months\nCancellation: 90 days written notice required",
            "fields_to_extract": ["contract_id", "parties", "monthly_fee", "term_length"]
        },
        "outputs": {
            "contract_id": "C-2024-789",
            "parties": [
                {"name": "TechStart Inc.", "role": "customer"},
                {"name": "CloudProvider LLC", "role": "provider"}
            ],
            "monthly_fee": {
                "amount": 5000.00,
                "currency": "EUR",
                "formatted": "€5,000"
            },
            "term_length": {
                "duration": 24,
                "unit": "months"
            },
            "cancellation_notice": {
                "days": 90,
                "type": "written"
            },
            "extraction_timestamp": "2024-01-20T10:35:00Z",
            "confidence_score": 0.92
        }
    }
]

# Evaluate document extractions
results = mlflow.genai.evaluate(
    data=extraction_tasks,
    scorers=[extraction_accuracy, format_compliance]
)

每行文档类型指南

from mlflow.genai.scorers import ExpectationsGuidelines
import mlflow

# Dataset with document-type specific guidelines
document_extraction_data = [
    {
        "inputs": {
            "document_type": "invoice",
            "document_text": "Invoice #INV-2024-001\nBill To: Acme Corp\nAmount: $1,234.56\nDue Date: 2024-03-15"
        },
        "outputs": {
            "invoice_number": "INV-2024-001",
            "customer": "Acme Corp",
            "total_amount": 1234.56,
            "due_date": "2024-03-15"
        },
        "expectations": {
            "guidelines": [
                """Invoice identification and classification:
                - Must extract invoice_number preserving exact format including prefixes/suffixes
                - Must identify invoice type (standard, credit memo, proforma) if specified
                - Must extract both invoice date and due date, calculating days until due
                - Must identify if this is a partial, final, or supplementary invoice
                - For recurring invoices, must extract frequency and period covered""",

                """Financial data extraction and validation:
                - Line items must be extracted as array with: description, quantity, unit_price, total
                - Must identify and separate: subtotal, tax amounts (with rates), shipping, discounts
                - Currency must be identified explicitly, not assumed to be USD
                - For discounts, must specify if percentage or fixed amount and what it applies to
                - Payment terms must be extracted (e.g., "Net 30", "2/10 Net 30")
                - Must flag any mathematical inconsistencies between line items and totals""",

                """Vendor and customer information:
                - Must extract complete billing and shipping addresses as separate objects
                - Company names must include any DBA ("doing business as") variations
                - Must extract tax IDs, business registration numbers if present
                - Contact information must be categorized (billing contact vs. delivery contact)
                - Must preserve any customer account numbers or reference codes"""
            ]
        }
    },
    {
        "inputs": {
            "document_type": "contract",
            "document_text": "This agreement between Party A and Party B commences on January 1, 2024..."
        },
        "outputs": {
            "parties": ["Party A", "Party B"],
            "effective_date": "2024-01-01",
            "term_length": "Not specified"
        },
        "expectations": {
            "guidelines": [
                """Party identification and roles:
                - Must extract all parties with their full legal names and entity types (Inc., LLC, etc.)
                - Must identify party roles (buyer/seller, licensee/licensor, employer/employee)
                - Must extract any parent company relationships or guarantors mentioned
                - Must capture all representatives, their titles, and authority to sign
                - Must identify jurisdiction for each party if specified""",

                """Critical dates and terms extraction:
                - Must differentiate between: execution date, effective date, and expiration date
                - Must extract notice periods for termination (e.g., "30 days written notice")
                - Must identify any automatic renewal clauses and their conditions
                - Must extract all milestone dates and deliverable deadlines
                - For amendments, must note which version/date of original contract is modified""",

                """Obligations and risk analysis:
                - Must extract all payment terms, amounts, and schedules
                - Must identify liability caps, indemnification clauses, and insurance requirements
                - Must flag any non-standard clauses that deviate from typical contracts
                - Must extract all conditions precedent and subsequent
                - Must identify dispute resolution mechanism (arbitration, litigation, jurisdiction)
                - Must extract any non-compete, non-solicitation, or confidentiality periods"""
            ]
        }
    },
    {
        "inputs": {
            "document_type": "medical_record",
            "document_text": "Patient: John Doe\nDOB: 1985-06-15\nDiagnosis: Type 2 Diabetes\nMedications: Metformin 500mg"
        },
        "outputs": {
            "patient_name": "John Doe",
            "date_of_birth": "1985-06-15",
            "diagnoses": ["Type 2 Diabetes"],
            "medications": [{"name": "Metformin", "dosage": "500mg"}]
        },
        "expectations": {
            "guidelines": [
                """HIPAA compliance and privacy protection:
                - Must never extract full SSN (only last 4 digits if needed for matching)
                - Must never include full insurance policy numbers or member IDs
                - Must redact or generalize sensitive mental health or substance abuse information
                - For minors, must flag records requiring additional consent for sharing
                - Must not extract genetic testing results without explicit permission flag""",

                """Clinical data extraction standards:
                - Diagnoses must use ICD-10 codes when available, with lay descriptions
                - Medications must include: generic name, brand name, dosage, frequency, route, start date
                - Must differentiate between active medications and discontinued/past medications
                - Allergies must specify type (drug, food, environmental) and reaction severity
                - Lab results must include: value, unit, reference range, abnormal flags
                - Vital signs must include measurement date/time and measurement conditions""",

                """Data quality and medical accuracy:
                - Must flag any potentially dangerous drug interactions if multiple meds listed
                - Must identify if vaccination records are up-to-date based on CDC guidelines
                - Must extract both chief complaint and final diagnosis separately
                - For chronic conditions, must note date of first diagnosis vs. most recent visit
                - Must preserve clinical abbreviations but also provide expansions
                - Must extract provider name, credentials, and NPI number if available"""
            ]
        }
    }
]

results = mlflow.genai.evaluate(
    data=document_extraction_data,
    scorers=[ExpectationsGuidelines()]
)

用于验证的自定义评分器

from mlflow.genai.scorers import scorer
from mlflow.genai.judges import meets_guidelines
import mlflow
import json

@scorer
def document_extraction_validator(inputs, outputs, trace=None):
    """Validate document extraction with source verification"""
    feedbacks = []

    # Get source document and extraction schema
    # Note: In production, extraction_schema would come from your trace
    # For this example, we're using data from inputs/outputs
    source_document = inputs.get("document_text", "")
    required_fields = outputs.get("extraction_schema", {})
    document_type = inputs.get("document_type", "general")

    # Validate completeness
    completeness_feedback = meets_guidelines(
        name="extraction_completeness",
        guidelines=[
            "The response must include all fields specified in required_fields",
            "The response must not have empty values for critical fields (marked as required)",
            "The response must indicate confidence scores for uncertain extractions"
        ],
        context={
            "source_document": source_document,
            "extracted_data": outputs,
            "required_fields": list(required_fields.keys()) if required_fields else [],
            "document_type": document_type
        }
    )
    feedbacks.append(completeness_feedback)

    # Validate accuracy against source
    accuracy_feedback = meets_guidelines(
        name="source_fidelity",
        guidelines="""The extracted_data must:
        - Only contain values that can be found in source_document
        - Preserve original capitalization for proper nouns and IDs
        - Not infer or calculate values not explicitly stated
        - Match the exact format of reference numbers in the source""",
        context={
            "source_document": source_document,
            "extracted_data": outputs
        }
    )
    feedbacks.append(accuracy_feedback)

    # Document-type specific validation
    if document_type == "financial":
        financial_feedback = meets_guidelines(
            name="financial_compliance",
            guidelines="""Financial document extraction compliance:

            Calculation Validation:
            - All line items must sum to match the subtotal within 0.01 tolerance
            - Tax calculations must equal (subtotal * tax_rate) within 0.01 tolerance
            - Total must equal (subtotal + tax + shipping - discounts) exactly
            - For multi-currency documents, must validate each currency separately
            - Must flag if any percentage exceeds 100% or is negative (except discounts)

            Security and Privacy:
            - Full account numbers must be masked, showing only last 4 digits
            - Routing numbers can be shown for verification but must be validated (9 digits)
            - Credit card numbers must show only first 6 (BIN) and last 4 digits
            - SSN/EIN must be partially masked (XXX-XX-1234 format)

            Regulatory Compliance:
            - Must extract and validate any compliance numbers (SOX, Basel III references)
            - For transactions over $10,000, must flag for AML review
            - Must identify if document requires SOC2 or PCI compliance handling
            - International transactions must include SWIFT/IBAN validation

            Anomaly Detection:
            - Flag unusual patterns: round numbers for all items, sequential invoice numbers
            - Flag if tax rate doesn't match known rates for the jurisdiction
            - Flag if payment terms exceed standard (Net 90+ is unusual)
            - Flag if discount percentage exceeds 50% without authorization code""",
            context={
                "source_document": source_document,
                "extracted_data": outputs,
                "document_type": document_type,
                "jurisdiction": outputs.get("jurisdiction", "US"),
                "known_tax_rates": outputs.get("tax_rates", {"US": 0.0875, "CA": 0.0725})
            }
        )
        feedbacks.append(financial_feedback)

    elif document_type == "legal":
        legal_feedback = meets_guidelines(
            name="legal_extraction_standards",
            guidelines=[
                "The response must preserve exact legal language for clauses",
                "The response must maintain hierarchical structure of sections",
                "The response must not paraphrase legal definitions",
                "The response must extract all cross-references to other sections"
            ],
            context={
                "source_document": source_document,
                "extracted_data": outputs,
                "document_type": document_type
            }
        )
        feedbacks.append(legal_feedback)

    return feedbacks

# Example financial document extraction for validation
financial_doc_data = [
    {
        "inputs": {
            "document_type": "financial",
            "document_text": """INVOICE #2024-FIN-001
            Date: January 15, 2024

            Bill To:
            GlobalTech Solutions Inc.
            Tax ID: 12-3456789
            500 Enterprise Way
            San Francisco, CA 94105

            Description              Qty    Unit Price    Amount
            Cloud Services Plan       1      $5,000.00    $5,000.00
            API Usage (per 1M)       50     $   20.00    $1,000.00
            Premium Support          1      $  500.00    $  500.00

            Subtotal:                                     $6,500.00
            Tax (8.75%):                                  $  568.75
            Total Due:                                    $7,068.75

            Payment Terms: Net 30
            Account: ****1234
            Please remit payment to routing number 123456789"""
        },
        "outputs": {
            "invoice_number": "2024-FIN-001",
            "bill_to": {
                "company": "GlobalTech Solutions Inc.",
                "tax_id": "12-3XXXXX89",  # Masked
                "address": "500 Enterprise Way, San Francisco, CA 94105"
            },
            "line_items": [
                {"description": "Cloud Services Plan", "quantity": 1, "unit_price": 5000.00, "amount": 5000.00},
                {"description": "API Usage (per 1M)", "quantity": 50, "unit_price": 20.00, "amount": 1000.00},
                {"description": "Premium Support", "quantity": 1, "unit_price": 500.00, "amount": 500.00}
            ],
            "financial_summary": {
                "subtotal": 6500.00,
                "tax_rate": 0.0875,
                "tax_amount": 568.75,
                "total": 7068.75
            },
            "payment_info": {
                "terms": "Net 30",
                "account_last_four": "1234",
                "routing_number": "123456789"  # Full routing OK for validation
            },
            "extraction_schema": {
                "invoice_number": "required",
                "bill_to": "required",
                "line_items": "required",
                "total": "required"
            },
            "jurisdiction": "CA",
            "tax_rates": {"CA": 0.0875, "US": 0.10}
        }
    },
    {
        "inputs": {
            "document_type": "legal",
            "document_text": """SOFTWARE LICENSE AGREEMENT

            This Agreement is entered into as of January 1, 2024 ("Effective Date")

            BETWEEN:
            TechCorp Inc., a Delaware corporation ("Licensor")
            123 Tech Plaza, Wilmington, DE 19801

            AND:
            StartupCo LLC, a California limited liability company ("Licensee")
            456 Innovation Drive, Palo Alto, CA 94301

            TERMS:
            1. License Grant: Non-exclusive, worldwide license
            2. Term: 36 months from Effective Date
            3. Fees: $10,000 monthly, due on 1st of each month
            4. Termination: Either party may terminate with 60 days written notice
            5. Liability Cap: Limited to fees paid in prior 12 months
            6. Dispute Resolution: Binding arbitration in Delaware
            7. Confidentiality Period: 5 years from termination"""
        },
        "outputs": {
            "agreement_type": "SOFTWARE LICENSE AGREEMENT",
            "parties": [
                {
                    "name": "TechCorp Inc.",
                    "entity_type": "Delaware corporation",
                    "role": "Licensor",
                    "address": "123 Tech Plaza, Wilmington, DE 19801"
                },
                {
                    "name": "StartupCo LLC",
                    "entity_type": "California limited liability company",
                    "role": "Licensee",
                    "address": "456 Innovation Drive, Palo Alto, CA 94301"
                }
            ],
            "key_dates": {
                "effective_date": "2024-01-01",
                "term_months": 36,
                "expiration_date": "2027-01-01"
            },
            "financial_terms": {
                "monthly_fee": 10000.00,
                "payment_due": "1st of each month",
                "total_contract_value": 360000.00
            },
            "termination": {
                "notice_period": "60 days",
                "notice_type": "written"
            },
            "risk_provisions": {
                "liability_cap": "fees paid in prior 12 months",
                "dispute_resolution": "Binding arbitration",
                "jurisdiction": "Delaware"
            },
            "confidentiality_period": "5 years from termination",
            "extraction_schema": {
                "parties": "required",
                "financial_terms": "required",
                "termination": "required"
            }
        }
    }
]

# Use in evaluation
results = mlflow.genai.evaluate(
    data=financial_doc_data,
    scorers=[document_extraction_validator]
)

后续步骤

创建基于指南的记分员 - 分步指南以实施指南评委
使用预定义的记分器 - 应用现成指南和其他评分器
评估概念概述 - 了解法官如何适应评估框架

通过

基于指南的 LLM 评分器

概述

优点

使用指南的三种方法

指南的工作原理

运行示例的先决条件

1. 预生成的 Guidelines() 记分器：全球准则

何时使用

例子

参数

具有音调和准确性的高级示例

预生成的记分器如何分析应用的输入/输出

请求

例子

响应

例子

2. 预先建立的 ExpectationsGuidelines() 记分器：逐行指南

何时使用

示例：

参数

预生成的记分器如何分析应用的输入/输出

3. judges.meets_guidelines() SDK（软件开发工具包）

何时使用

示例：

参数

自定义记分器示例

返回值

编写有效的准则

引用上下文变量

最佳做法

实际示例

客户服务聊天机器人

所有互动的全局准则

特定情境的每行指导方针

用于复杂评估的自定义评分器

文档提取应用

提取质量的全球准则

每行文档类型指南

用于验证的自定义评分器

后续步骤

反馈

其他资源

1. 预生成的 `Guidelines()` 记分器：全球准则

2. 预先建立的 `ExpectationsGuidelines()` 记分器：逐行指南

3. `judges.meets_guidelines()` SDK（软件开发工具包）