基于提示的法官

概述

基于提示的评委通过可自定义的选择类别(例如优秀/良好/差)和可选的数值评分支持多级质量评估。 与提供二元通过/失败评估 的基于准则的评委 不同,基于提示的评委提供:

  • 带有数字映射的分级评分级别,用于跟踪改进
  • 复杂多维评估条件的完整提示控制
  • 针对用例定制的特定于域的类别
  • 可聚合指标 以衡量一段时间内的质量趋势

何时使用

根据需要选择基于提示的评委:

  • 超越简单通过/失败的多层质量评估
  • 定量分析和版本比较的数字分数
  • 需要自定义类别的复杂评估条件
  • 跨数据集聚合指标

根据需要选择 基于准则的评委

  • 简单合格/不合格合规评估
  • 业务利益干系人无需编码即可编写/更新条件
  • 评估规则的快速迭代

重要

虽然基于提示的评委可用作独立的 API/SDK,但它们必须包装在 记分器 中供 评估工具生产监视服务使用。

运行示例的先决条件

  1. 安装 MLflow 和所需包

    pip install --upgrade "mlflow[databricks]>=3.1.0"
    
  2. 请按照 设置环境快速指南 创建 MLflow 试验。

下面是一个简单的示例,展示基于提示的判断系统的强大功能:

from mlflow.genai.judges import custom_prompt_judge

# Create a multi-level quality judge
response_quality_judge = custom_prompt_judge(
    name="response_quality",
    prompt_template="""Evaluate the quality of this customer service response:

<request>{{request}}</request>
<response>{{response}}</response>

Choose the most appropriate rating:

[[excellent]]: Empathetic, complete solution, proactive help offered
[[good]]: Addresses the issue adequately with professional tone
[[poor]]: Incomplete, unprofessional, or misses key concerns""",
    numeric_values={
        "excellent": 1.0,
        "good": 0.7,
        "poor": 0.0
    }
)

# Direct usage
feedback = response_quality_judge(
    request="My order arrived damaged!",
    response="I'm so sorry to hear that. I've initiated a replacement order that will arrive tomorrow, and issued a full refund. Is there anything else I can help with?"
)

print(feedback.value)         # 1.0
print(feedback.metadata)      # {"string_value": "excellent"}
print(feedback.rationale)     # Detailed explanation of the rating

核心概念

SDK 概述

custom_prompt_judge 函数创建一个自定义 LLM 法官,根据提示模板评估输入:

from mlflow.genai.judges import custom_prompt_judge

judge = custom_prompt_judge(
    name="formality",
    prompt_template="...",  # Your custom prompt with {{variables}} and [[choices]]
    numeric_values={"formal": 1.0, "informal": 0.0}  # Optional numeric mapping
)

# Returns an mlflow.entities.Feedback object
feedback = judge(request="Hello", response="Hey there!")

参数

参数 类型 必选 DESCRIPTION
name str 是的 评估的名称,显示在 MLflow UI 中,用于标识法官的输出
prompt_template str 是的 模板字符串包含:
  • {{variables}}:动态内容的占位符
  • [[choices]]:必须选择的定义选项,法官须从中选择
numeric_values dict[str, float] \| None 将选择名称映射到数值分数(推荐使用 0 到 1 的评分标准)。
  • 没有:返回字符串选择值
  • 使用:返回数值分数并将字符串选择存储在元数据中
model str \| None 要使用的特定判断模型(默认为 MLflow 的优化判断模型)

为何使用数值映射?

如果有多个选择标签(例如“优秀”、“良好”、“差”),字符串值使得很难跨评估运行跟踪质量改进。

数值映射启用:

  • 定量比较:查看平均质量是否从 0.6 提高到 0.8
  • 聚合指标:计算数据集中平均分数
  • 版本比较:跟踪更改是改进还是降低质量
  • 基于阈值的监视:当质量低于可接受的级别时设置警报

如果没有数值,则只能看到标签分布(例如,40%“good”,60%“差”),因此很难衡量整体改进。

返回值

该函数返回一个可调用对象。

  • 提示模板中,接受与{{variables}}匹配的关键字参数
  • 返回一个包含mlflow.entities.Feedback对象:
    • value:所选选项(字符串)或数值分数(如果 numeric_values 提供)
    • rationale:LLM 对其选择的解释
    • metadata:附加信息,包括在使用数值时字符串选择的说明
    • name:你提供的名称
    • error:评估失败时的错误详细信息

提示模板要求

选择定义格式

必须使用双方括号 [[choice_name]]定义选项:

prompt_template = """Evaluate the response formality:

<request>{{request}}</request>
<response>{{response}}</response>

Select one category:

[[formal]]: Professional language, proper grammar, no contractions
[[semi_formal]]: Mix of professional and conversational elements
[[informal]]: Casual language, contractions, colloquialisms"""

变量占位符

对动态内容使用双大括号 {{variable}}

prompt_template = """Assess if the response uses appropriate sources:

Question: {{question}}
Response: {{response}}
Available Sources: {{retrieved_documents}}
Citation Policy: {{citation_policy}}

Choose one:

[[well_cited]]: All claims properly cite available sources
[[partially_cited]]: Some claims cite sources, others do not
[[poorly_cited]]: Claims lack proper citations"""

常见评估模式

Likert 量表模式

创建标准的 5 分或 7 分满意度量表。

satisfaction_judge = custom_prompt_judge(
    name="customer_satisfaction",
    prompt_template="""Based on this interaction, rate the likely customer satisfaction:

Customer Request: {{request}}
Agent Response: {{response}}

Select satisfaction level:

[[very_satisfied]]: Response exceeds expectations with exceptional service
[[satisfied]]: Response meets expectations adequately
[[neutral]]: Response is acceptable but unremarkable
[[dissatisfied]]: Response fails to meet basic expectations
[[very_dissatisfied]]: Response is unhelpful or problematic""",
    numeric_values={
        "very_satisfied": 1.0,
        "satisfied": 0.75,
        "neutral": 0.5,
        "dissatisfied": 0.25,
        "very_dissatisfied": 0.0
    }
)

基于评分标准的评分

使用明确的条件实现详细的评分准则:

code_review_rubric = custom_prompt_judge(
    name="code_review_rubric",
    prompt_template="""Evaluate this code review using our quality rubric:

Original Code: {{original_code}}
Review Comments: {{review_comments}}
Code Type: {{code_type}}

Score the review quality:

[[comprehensive]]: Identifies all issues including edge cases, security concerns, performance implications, and suggests specific improvements with examples
[[thorough]]: Catches major issues and most minor ones, provides good suggestions but may miss some edge cases
[[adequate]]: Identifies obvious issues and provides basic feedback, misses subtle problems
[[superficial]]: Only catches surface-level issues, feedback is vague or generic
[[inadequate]]: Misses critical issues or provides incorrect feedback""",
    numeric_values={
        "comprehensive": 1.0,
        "thorough": 0.8,
        "adequate": 0.6,
        "superficial": 0.3,
        "inadequate": 0.0
    }
)

实际示例

客户服务质量

from mlflow.genai.judges import custom_prompt_judge
from mlflow.genai.scorers import scorer
import mlflow

# Issue resolution status judge
resolution_judge = custom_prompt_judge(
    name="issue_resolution",
    prompt_template="""Evaluate if the customer's issue was resolved:

Customer Message: {{customer_message}}
Agent Response: {{agent_response}}
Issue Type: {{issue_type}}

Rate the resolution status:

[[fully_resolved]]: Issue completely addressed with clear solution provided
[[partially_resolved]]: Some progress made but follow-up needed
[[unresolved]]: Issue not addressed or solution unclear
[[escalated]]: Appropriately escalated to higher support tier""",
    numeric_values={
        "fully_resolved": 1.0,
        "partially_resolved": 0.5,
        "unresolved": 0.0,
        "escalated": 0.7  # Positive score for appropriate escalation
    }
)

# Empathy and tone judge
empathy_judge = custom_prompt_judge(
    name="empathy_score",
    prompt_template="""Assess the emotional intelligence of the response:

Customer Emotion: {{customer_emotion}}
Agent Response: {{agent_response}}

Rate the empathy shown:

[[exceptional]]: Acknowledges emotions, validates concerns, shows genuine care
[[good]]: Shows understanding and appropriate concern
[[adequate]]: Professional but somewhat impersonal
[[poor]]: Cold, dismissive, or inappropriate emotional response""",
    numeric_values={
        "exceptional": 1.0,
        "good": 0.75,
        "adequate": 0.5,
        "poor": 0.0
    }
)

# Create a comprehensive customer service scorer
@scorer
def customer_service_quality(inputs, outputs, trace):
    """Comprehensive customer service evaluation"""
    feedbacks = []

    # Evaluate resolution status
    feedbacks.append(resolution_judge(
        customer_message=inputs.get("message", ""),
        agent_response=outputs.get("response", ""),
        issue_type=inputs.get("issue_type", "general")
    ))

    # Evaluate empathy if customer shows emotion
    customer_emotion = inputs.get("detected_emotion", "neutral")
    if customer_emotion in ["frustrated", "angry", "upset", "worried"]:
        feedbacks.append(empathy_judge(
            customer_emotion=customer_emotion,
            agent_response=outputs.get("response", "")
        ))

    return feedbacks

# Example evaluation
eval_data = [
    {
        "inputs": {
            "message": "I've been waiting 3 weeks for my refund! This is unacceptable!",
            "issue_type": "refund",
            "detected_emotion": "angry"
        },
        "outputs": {
            "response": "I completely understand your frustration - 3 weeks is far too long to wait for a refund. I'm escalating this to our finance team immediately. You'll receive your refund within 24 hours, plus a $50 credit for the inconvenience. I'm also sending you my direct email so you can reach me if there are any other delays."
        }
    }
]

results = mlflow.genai.evaluate(
    data=eval_data,
    scorers=[customer_service_quality]
)

内容质量评估

# Technical documentation quality judge
doc_quality_judge = custom_prompt_judge(
    name="documentation_quality",
    prompt_template="""Evaluate this technical documentation:

Content: {{content}}
Target Audience: {{audience}}
Documentation Type: {{doc_type}}

Rate the documentation quality:

[[excellent]]: Clear, complete, well-structured with examples, appropriate depth
[[good]]: Covers topic well, mostly clear, could use minor improvements
[[fair]]: Basic coverage, some unclear sections, missing important details
[[poor]]: Confusing, incomplete, or significantly flawed""",
    numeric_values={
        "excellent": 1.0,
        "good": 0.75,
        "fair": 0.4,
        "poor": 0.0
    }
)

# Marketing copy effectiveness
marketing_judge = custom_prompt_judge(
    name="marketing_effectiveness",
    prompt_template="""Rate this marketing copy's effectiveness:

Copy: {{copy}}
Product: {{product}}
Target Demographic: {{target_demographic}}
Call to Action: {{cta}}

Evaluate effectiveness:

[[highly_effective]]: Compelling, clear value prop, strong CTA, perfect for audience
[[effective]]: Good messaging, decent CTA, reasonably targeted
[[moderately_effective]]: Some good elements but lacks impact or clarity
[[ineffective]]: Weak messaging, unclear value, poor audience fit""",
    numeric_values={
        "highly_effective": 1.0,
        "effective": 0.7,
        "moderately_effective": 0.4,
        "ineffective": 0.0
    }
)

代码评审质量

# Security review judge
security_review_judge = custom_prompt_judge(
    name="security_review_quality",
    prompt_template="""Evaluate the security aspects of this code review:

Original Code: {{code}}
Review Comments: {{review_comments}}
Security Vulnerabilities Found: {{vulnerabilities_mentioned}}

Rate the security review quality:

[[comprehensive]]: Identifies all security issues, explains risks, suggests secure alternatives
[[thorough]]: Catches major security flaws, good explanations
[[basic]]: Identifies obvious security issues only
[[insufficient]]: Misses critical security vulnerabilities""",
    numeric_values={
        "comprehensive": 1.0,
        "thorough": 0.75,
        "basic": 0.4,
        "insufficient": 0.0
    }
)

# Code clarity feedback judge
code_clarity_judge = custom_prompt_judge(
    name="code_clarity_feedback",
    prompt_template="""Assess the code review's feedback on readability:

Original Code Complexity: {{complexity_score}}
Review Feedback: {{review_comments}}
Readability Issues Identified: {{readability_issues}}

Rate the clarity feedback:

[[excellent]]: Identifies all clarity issues, suggests specific improvements, considers maintainability
[[good]]: Points out main clarity problems with helpful suggestions
[[adequate]]: Basic feedback on obvious readability issues
[[minimal]]: Superficial or missing important clarity feedback""",
    numeric_values={
        "excellent": 1.0,
        "good": 0.7,
        "adequate": 0.4,
        "minimal": 0.0
    }
)

医疗保健通信

# Patient communication appropriateness
patient_comm_judge = custom_prompt_judge(
    name="patient_communication",
    prompt_template="""Evaluate this healthcare provider's response to a patient:

Patient Question: {{patient_question}}
Provider Response: {{provider_response}}
Patient Health Literacy Level: {{health_literacy}}
Sensitive Topics: {{sensitive_topics}}

Rate communication appropriateness:

[[excellent]]: Clear, compassionate, appropriate language level, addresses concerns fully
[[good]]: Generally clear and caring, minor room for improvement
[[acceptable]]: Adequate but could be clearer or more empathetic
[[poor]]: Unclear, uses too much jargon, or lacks appropriate empathy""",
    numeric_values={
        "excellent": 1.0,
        "good": 0.75,
        "acceptable": 0.5,
        "poor": 0.0
    }
)

# Clinical note quality
clinical_note_judge = custom_prompt_judge(
    name="clinical_note_quality",
    prompt_template="""Assess this clinical note's quality:

Note Content: {{note_content}}
Note Type: {{note_type}}
Required Elements: {{required_elements}}

Rate the clinical documentation:

[[comprehensive]]: All required elements present, clear, follows standards, actionable
[[complete]]: Most elements present, generally clear, minor gaps
[[incomplete]]: Missing important elements or lacks clarity
[[deficient]]: Significant gaps, unclear, or doesn't meet documentation standards""",
    numeric_values={
        "comprehensive": 1.0,
        "complete": 0.7,
        "incomplete": 0.3,
        "deficient": 0.0
    }
)

成对响应比较

使用基于提示的法官比较两个响应并确定哪个响应更好。 这对于 A/B 测试、模型比较或首选项学习非常有用。

注释

成对比较判断器不能与 mlflow.evaluate() 或作为评分器一起使用,因为它们同时评估两个响应,而不是单个响应。 将它们直接用于比较分析。

from mlflow.genai.judges import custom_prompt_judge

# Response preference judge
preference_judge = custom_prompt_judge(
    name="response_preference",
    prompt_template="""Compare these two responses to the same question and determine which is better:

Question: {{question}}

Response A: {{response_a}}

Response B: {{response_b}}

Evaluation Criteria:
1. Accuracy and completeness of information
2. Clarity and ease of understanding
3. Helpfulness and actionability
4. Appropriate tone for the context

Choose your preference:

[[strongly_prefer_a]]: Response A is significantly better across most criteria
[[slightly_prefer_a]]: Response A is marginally better overall
[[equal]]: Both responses are equally good (or equally poor)
[[slightly_prefer_b]]: Response B is marginally better overall
[[strongly_prefer_b]]: Response B is significantly better across most criteria""",
    numeric_values={
        "strongly_prefer_a": -1.0,
        "slightly_prefer_a": -0.5,
        "equal": 0.0,
        "slightly_prefer_b": 0.5,
        "strongly_prefer_b": 1.0
    }
)

# Example usage for model comparison
question = "How do I improve my GenAI app's response quality?"

response_model_v1 = """To improve response quality, you should:
1. Add more training data
2. Fine-tune your model
3. Use better prompts"""

response_model_v2 = """To improve your GenAI app's response quality, consider these strategies:

1. **Enhance your prompts**: Use clear, specific instructions with examples
2. **Implement evaluation**: Use MLflow's LLM judges to measure quality systematically
3. **Collect feedback**: Gather user feedback to identify improvement areas
4. **Iterate on weak areas**: Focus on responses that score poorly
5. **A/B test changes**: Compare versions to ensure improvements

Start with evaluation to establish a baseline, then iterate based on data."""

# Compare responses
feedback = preference_judge(
    question=question,
    response_a=response_model_v1,
    response_b=response_model_v2
)

print(f"Preference: {feedback.metadata['string_value']}")  # "strongly_prefer_b"
print(f"Score: {feedback.value}")  # 1.0
print(f"Rationale: {feedback.rationale}")

专用比较法官

# Technical accuracy comparison for documentation
tech_comparison_judge = custom_prompt_judge(
    name="technical_comparison",
    prompt_template="""Compare these two technical explanations:

Topic: {{topic}}
Target Audience: {{audience}}

Explanation A: {{explanation_a}}

Explanation B: {{explanation_b}}

Focus on:
- Technical accuracy and precision
- Appropriate depth for the audience
- Use of examples and analogies
- Completeness without overwhelming detail

Which explanation is better?

[[a_much_better]]: A is significantly more accurate and appropriate
[[a_slightly_better]]: A is marginally better in accuracy or clarity
[[equivalent]]: Both are equally good technically
[[b_slightly_better]]: B is marginally better in accuracy or clarity
[[b_much_better]]: B is significantly more accurate and appropriate""",
    numeric_values={
        "a_much_better": -1.0,
        "a_slightly_better": -0.5,
        "equivalent": 0.0,
        "b_slightly_better": 0.5,
        "b_much_better": 1.0
    }
)

# Empathy comparison for customer service
empathy_comparison_judge = custom_prompt_judge(
    name="empathy_comparison",
    prompt_template="""Compare the emotional intelligence of these customer service responses:

Customer Situation: {{situation}}
Customer Emotion: {{emotion}}

Agent Response A: {{response_a}}

Agent Response B: {{response_b}}

Evaluate which response better:
- Acknowledges the customer's emotions
- Shows genuine understanding and care
- Offers appropriate emotional support
- Maintains professional boundaries

Which response shows better emotional intelligence?

[[a_far_superior]]: A shows much better emotional intelligence
[[a_better]]: A is somewhat more empathetic
[[both_good]]: Both show good emotional intelligence
[[b_better]]: B is somewhat more empathetic
[[b_far_superior]]: B shows much better emotional intelligence""",
    numeric_values={
        "a_far_superior": -1.0,
        "a_better": -0.5,
        "both_good": 0.0,
        "b_better": 0.5,
        "b_far_superior": 1.0
    }
)

实用比较工作流

# Compare outputs from different prompt versions
def compare_prompt_versions(test_cases, prompt_v1, prompt_v2, model_client):
    """Compare two prompt versions across multiple test cases"""
    results = []

    for test_case in test_cases:
        # Generate responses with each prompt
        response_v1 = model_client.generate(prompt_v1.format(**test_case))
        response_v2 = model_client.generate(prompt_v2.format(**test_case))

        # Compare responses
        feedback = preference_judge(
            question=test_case["question"],
            response_a=response_v1,
            response_b=response_v2
        )

        results.append({
            "question": test_case["question"],
            "preference": feedback.metadata["string_value"],
            "score": feedback.value,
            "rationale": feedback.rationale
        })

    # Analyze results
    avg_score = sum(r["score"] for r in results) / len(results)

    if avg_score < -0.2:
        print(f"Prompt V1 is preferred (avg score: {avg_score:.2f})")
    elif avg_score > 0.2:
        print(f"Prompt V2 is preferred (avg score: {avg_score:.2f})")
    else:
        print(f"Prompts perform similarly (avg score: {avg_score:.2f})")

    return results

# Compare different model outputs
def compare_models(questions, model_a, model_b, comparison_judge):
    """Compare two models across a set of questions"""
    win_counts = {"model_a": 0, "model_b": 0, "tie": 0}

    for question in questions:
        response_a = model_a.generate(question)
        response_b = model_b.generate(question)

        feedback = comparison_judge(
            question=question,
            response_a=response_a,
            response_b=response_b
        )

        # Count wins based on preference strength
        if feedback.value <= -0.5:
            win_counts["model_a"] += 1
        elif feedback.value >= 0.5:
            win_counts["model_b"] += 1
        else:
            win_counts["tie"] += 1

    print(f"Model comparison results: {win_counts}")
    return win_counts

高级使用模式

条件评分

根据上下文实现不同的评估条件:

@scorer
def adaptive_quality_scorer(inputs, outputs, trace):
    """Applies different judges based on context"""

    # Determine which judge to use based on input characteristics
    query_type = inputs.get("query_type", "general")

    if query_type == "technical":
        judge = custom_prompt_judge(
            name="technical_response",
            prompt_template="""Rate this technical response:

Question: {{question}}
Response: {{response}}
Required Depth: {{depth_level}}

[[expert]]: Demonstrates deep expertise, includes advanced concepts
[[proficient]]: Good technical accuracy, appropriate depth
[[basic]]: Correct but lacks depth or nuance
[[incorrect]]: Contains technical errors or misconceptions""",
            numeric_values={
                "expert": 1.0,
                "proficient": 0.75,
                "basic": 0.5,
                "incorrect": 0.0
            }
        )

        return judge(
            question=inputs["question"],
            response=outputs["response"],
            depth_level=inputs.get("required_depth", "intermediate")
        )

    elif query_type == "support":
        judge = custom_prompt_judge(
            name="support_response",
            prompt_template="""Rate this support response:

Issue: {{issue}}
Response: {{response}}
Customer Status: {{customer_status}}

[[excellent]]: Solves issue completely, proactive, appropriate for customer status
[[good]]: Addresses issue well, professional
[[fair]]: Partially helpful but incomplete
[[poor]]: Unhelpful or inappropriate""",
            numeric_values={
                "excellent": 1.0,
                "good": 0.7,
                "fair": 0.4,
                "poor": 0.0
            }
        )

        return judge(
            issue=inputs["question"],
            response=outputs["response"],
            customer_status=inputs.get("customer_status", "standard")
        )

对聚合策略进行评分

智能地组合多个法官分数:

@scorer
def weighted_quality_scorer(inputs, outputs, trace):
    """Combines multiple judges with weighted scoring"""

    # Define judges and their weights
    judges_config = [
        {
            "judge": custom_prompt_judge(
                name="accuracy",
                prompt_template="...",  # Your accuracy template
                numeric_values={"high": 1.0, "medium": 0.5, "low": 0.0}
            ),
            "weight": 0.4,
            "args": {"question": inputs["question"], "response": outputs["response"]}
        },
        {
            "judge": custom_prompt_judge(
                name="completeness",
                prompt_template="...",  # Your completeness template
                numeric_values={"complete": 1.0, "partial": 0.5, "incomplete": 0.0}
            ),
            "weight": 0.3,
            "args": {"response": outputs["response"], "requirements": inputs.get("requirements", [])}
        },
        {
            "judge": custom_prompt_judge(
                name="clarity",
                prompt_template="...",  # Your clarity template
                numeric_values={"clear": 1.0, "adequate": 0.6, "unclear": 0.0}
            ),
            "weight": 0.3,
            "args": {"response": outputs["response"]}
        }
    ]

    # Collect all feedbacks
    feedbacks = []
    weighted_score = 0.0

    for config in judges_config:
        feedback = config["judge"](**config["args"])
        feedbacks.append(feedback)

        # Add to weighted score if numeric
        if isinstance(feedback.value, (int, float)):
            weighted_score += feedback.value * config["weight"]

    # Add composite score as additional feedback
    from mlflow.entities import Feedback
    composite_feedback = Feedback(
        name="weighted_quality_score",
        value=weighted_score,
        rationale=f"Weighted combination of {len(judges_config)} quality dimensions"
    )
    feedbacks.append(composite_feedback)

    return feedbacks

最佳做法

设计有效选项

1. 做出相互排斥和详尽的选择

# Good - clear distinctions, covers all cases
"""[[approved]]: Meets all requirements, ready for production
[[needs_revision]]: Has issues that must be fixed before approval
[[rejected]]: Fundamental flaws, requires complete rework"""

# Bad - overlapping and ambiguous
"""[[good]]: The response is good
[[okay]]: The response is okay
[[fine]]: The response is fine"""

2.为每个选择提供特定条件

# Good - specific, measurable criteria
"""[[secure]]: No vulnerabilities, follows all security best practices, includes input validation
[[mostly_secure]]: Minor security concerns that should be addressed but aren't critical
[[insecure]]: Contains vulnerabilities that could be exploited"""

# Bad - vague criteria
"""[[secure]]: Looks secure
[[not_secure]]: Has problems"""

3. 逻辑顺序选择 (最好到最差)

# Good - clear progression
numeric_values = {
    "exceptional": 1.0,
    "good": 0.75,
    "satisfactory": 0.5,
    "needs_improvement": 0.25,
    "unacceptable": 0.0
}

数字刻度设计

1. 在评委之间使用一致的评分标准

# All judges use 0-1 scale
quality_judge = custom_prompt_judge(..., numeric_values={"high": 1.0, "medium": 0.5, "low": 0.0})
accuracy_judge = custom_prompt_judge(..., numeric_values={"accurate": 1.0, "partial": 0.5, "wrong": 0.0})

2. 留出空白,供将来优化

# Allows adding intermediate levels later
numeric_values = {
    "excellent": 1.0,
    "good": 0.7,    # Gap allows for "very_good" at 0.85
    "fair": 0.4,    # Gap allows for "satisfactory" at 0.55
    "poor": 0.0
}

3.考虑领域特定的尺度

# Academic grading scale
academic_scale = {
    "A": 4.0,
    "B": 3.0,
    "C": 2.0,
    "D": 1.0,
    "F": 0.0
}

# Net Promoter Score scale
nps_scale = {
    "promoter": 1.0,      # 9-10
    "passive": 0.0,       # 7-8
    "detractor": -1.0     # 0-6
}

提示设计技巧

1. 清晰地构建提示

prompt_template = """[Clear Task Description]
Evaluate the technical accuracy of this response.

[Context Section]
Question: {{question}}
Response: {{response}}
Technical Domain: {{___domain}}

[Evaluation Criteria]
Consider: factual accuracy, appropriate depth, correct terminology

[Choice Definitions]
[[accurate]]: All technical facts correct, appropriate level of detail
[[mostly_accurate]]: Minor inaccuracies that don't affect core understanding
[[inaccurate]]: Contains significant errors or misconceptions"""

2.在有用时包括示例

prompt_template = """Assess the urgency level of this support ticket.

Ticket: {{ticket_content}}

Examples of each level:
- Critical: System down, data loss, security breach
- High: Major feature broken, blocking work
- Medium: Performance issues, non-critical bugs
- Low: Feature requests, minor UI issues

Choose urgency level:
[[critical]]: Immediate attention required, business impact
[[high]]: Urgent, significant user impact
[[medium]]: Important but not urgent
[[low]]: Can be addressed in normal workflow"""

与基于准则的法官的比较

方面 基于准则 基于提示
评估类型 二进制传递/失败 多层分类
得分 “是”或“否” 具有可选数值的自定义选项
最适用于 符合性、策略遵循性 质量评估,满意度评分
迭代速度 非常快 - 只需更新指南文本 中等 - 可能需要调整选项
业务用户友好 ✅ 高 - 自然语言规则 ⚠️ 中等 - 需要理解选项和完整提示
聚合 统计合格率/不合格率 计算平均值,跟踪趋势

验证和错误处理

选择验证

法官验证:

  • 使用 [[choice_name]] 格式正确定义选择项
  • 选择名称是字母数字(可以包含下划线)
  • 模板中至少定义了一个选项
# This will raise an error - no choices defined
invalid_judge = custom_prompt_judge(
    name="invalid",
    prompt_template="Rate the response: {{response}}"
)
# ValueError: Prompt template must include choices denoted with [[CHOICE_NAME]]

数值验证

使用 numeric_values时,必须映射所有选项:

# This will raise an error - missing choice in numeric_values
invalid_judge = custom_prompt_judge(
    name="invalid",
    prompt_template="""Choose:
    [[option_a]]: First option
    [[option_b]]: Second option""",
    numeric_values={"option_a": 1.0}  # Missing option_b
)
# ValueError: numeric_values keys must match the choices

模板变量验证

缺少的模板变量在执行过程中引发错误:

judge = custom_prompt_judge(
    name="test",
    prompt_template="{{request}} {{response}} [[good]]: Good"
)

# This will raise an error - missing 'response' variable
judge(request="Hello")
# KeyError: Template variable 'response' not found

后续步骤