概述
基于提示的评委通过可自定义的选择类别(例如优秀/良好/差)和可选的数值评分支持多级质量评估。 与提供二元通过/失败评估 的基于准则的评委 不同,基于提示的评委提供:
- 带有数字映射的分级评分级别,用于跟踪改进
- 复杂多维评估条件的完整提示控制
- 针对用例定制的特定于域的类别
- 可聚合指标 以衡量一段时间内的质量趋势
何时使用
根据需要选择基于提示的评委:
- 超越简单通过/失败的多层质量评估
- 定量分析和版本比较的数字分数
- 需要自定义类别的复杂评估条件
- 跨数据集聚合指标
根据需要选择 基于准则的评委 :
- 简单合格/不合格合规评估
- 业务利益干系人无需编码即可编写/更新条件
- 评估规则的快速迭代
运行示例的先决条件
安装 MLflow 和所需包
pip install --upgrade "mlflow[databricks]>=3.1.0"
请按照 设置环境快速指南 创建 MLflow 试验。
例
下面是一个简单的示例,展示基于提示的判断系统的强大功能:
from mlflow.genai.judges import custom_prompt_judge
# Create a multi-level quality judge
response_quality_judge = custom_prompt_judge(
name="response_quality",
prompt_template="""Evaluate the quality of this customer service response:
<request>{{request}}</request>
<response>{{response}}</response>
Choose the most appropriate rating:
[[excellent]]: Empathetic, complete solution, proactive help offered
[[good]]: Addresses the issue adequately with professional tone
[[poor]]: Incomplete, unprofessional, or misses key concerns""",
numeric_values={
"excellent": 1.0,
"good": 0.7,
"poor": 0.0
}
)
# Direct usage
feedback = response_quality_judge(
request="My order arrived damaged!",
response="I'm so sorry to hear that. I've initiated a replacement order that will arrive tomorrow, and issued a full refund. Is there anything else I can help with?"
)
print(feedback.value) # 1.0
print(feedback.metadata) # {"string_value": "excellent"}
print(feedback.rationale) # Detailed explanation of the rating
核心概念
SDK 概述
该 custom_prompt_judge
函数创建一个自定义 LLM 法官,根据提示模板评估输入:
from mlflow.genai.judges import custom_prompt_judge
judge = custom_prompt_judge(
name="formality",
prompt_template="...", # Your custom prompt with {{variables}} and [[choices]]
numeric_values={"formal": 1.0, "informal": 0.0} # Optional numeric mapping
)
# Returns an mlflow.entities.Feedback object
feedback = judge(request="Hello", response="Hey there!")
参数
参数 | 类型 | 必选 | DESCRIPTION |
---|---|---|---|
name |
str |
是的 | 评估的名称,显示在 MLflow UI 中,用于标识法官的输出 |
prompt_template |
str |
是的 | 模板字符串包含:
|
numeric_values |
dict[str, float] \| None |
否 | 将选择名称映射到数值分数(推荐使用 0 到 1 的评分标准)。
|
model |
str \| None |
否 | 要使用的特定判断模型(默认为 MLflow 的优化判断模型) |
为何使用数值映射?
如果有多个选择标签(例如“优秀”、“良好”、“差”),字符串值使得很难跨评估运行跟踪质量改进。
数值映射启用:
- 定量比较:查看平均质量是否从 0.6 提高到 0.8
- 聚合指标:计算数据集中平均分数
- 版本比较:跟踪更改是改进还是降低质量
- 基于阈值的监视:当质量低于可接受的级别时设置警报
如果没有数值,则只能看到标签分布(例如,40%“good”,60%“差”),因此很难衡量整体改进。
返回值
该函数返回一个可调用对象。
- 提示模板中,接受与
{{variables}}
匹配的关键字参数 - 返回一个包含
mlflow.entities.Feedback
对象:-
value
:所选选项(字符串)或数值分数(如果numeric_values
提供) -
rationale
:LLM 对其选择的解释 -
metadata
:附加信息,包括在使用数值时字符串选择的说明 -
name
:你提供的名称 -
error
:评估失败时的错误详细信息
-
提示模板要求
选择定义格式
必须使用双方括号 [[choice_name]]
定义选项:
prompt_template = """Evaluate the response formality:
<request>{{request}}</request>
<response>{{response}}</response>
Select one category:
[[formal]]: Professional language, proper grammar, no contractions
[[semi_formal]]: Mix of professional and conversational elements
[[informal]]: Casual language, contractions, colloquialisms"""
变量占位符
对动态内容使用双大括号 {{variable}}
。
prompt_template = """Assess if the response uses appropriate sources:
Question: {{question}}
Response: {{response}}
Available Sources: {{retrieved_documents}}
Citation Policy: {{citation_policy}}
Choose one:
[[well_cited]]: All claims properly cite available sources
[[partially_cited]]: Some claims cite sources, others do not
[[poorly_cited]]: Claims lack proper citations"""
常见评估模式
Likert 量表模式
创建标准的 5 分或 7 分满意度量表。
satisfaction_judge = custom_prompt_judge(
name="customer_satisfaction",
prompt_template="""Based on this interaction, rate the likely customer satisfaction:
Customer Request: {{request}}
Agent Response: {{response}}
Select satisfaction level:
[[very_satisfied]]: Response exceeds expectations with exceptional service
[[satisfied]]: Response meets expectations adequately
[[neutral]]: Response is acceptable but unremarkable
[[dissatisfied]]: Response fails to meet basic expectations
[[very_dissatisfied]]: Response is unhelpful or problematic""",
numeric_values={
"very_satisfied": 1.0,
"satisfied": 0.75,
"neutral": 0.5,
"dissatisfied": 0.25,
"very_dissatisfied": 0.0
}
)
基于评分标准的评分
使用明确的条件实现详细的评分准则:
code_review_rubric = custom_prompt_judge(
name="code_review_rubric",
prompt_template="""Evaluate this code review using our quality rubric:
Original Code: {{original_code}}
Review Comments: {{review_comments}}
Code Type: {{code_type}}
Score the review quality:
[[comprehensive]]: Identifies all issues including edge cases, security concerns, performance implications, and suggests specific improvements with examples
[[thorough]]: Catches major issues and most minor ones, provides good suggestions but may miss some edge cases
[[adequate]]: Identifies obvious issues and provides basic feedback, misses subtle problems
[[superficial]]: Only catches surface-level issues, feedback is vague or generic
[[inadequate]]: Misses critical issues or provides incorrect feedback""",
numeric_values={
"comprehensive": 1.0,
"thorough": 0.8,
"adequate": 0.6,
"superficial": 0.3,
"inadequate": 0.0
}
)
实际示例
客户服务质量
from mlflow.genai.judges import custom_prompt_judge
from mlflow.genai.scorers import scorer
import mlflow
# Issue resolution status judge
resolution_judge = custom_prompt_judge(
name="issue_resolution",
prompt_template="""Evaluate if the customer's issue was resolved:
Customer Message: {{customer_message}}
Agent Response: {{agent_response}}
Issue Type: {{issue_type}}
Rate the resolution status:
[[fully_resolved]]: Issue completely addressed with clear solution provided
[[partially_resolved]]: Some progress made but follow-up needed
[[unresolved]]: Issue not addressed or solution unclear
[[escalated]]: Appropriately escalated to higher support tier""",
numeric_values={
"fully_resolved": 1.0,
"partially_resolved": 0.5,
"unresolved": 0.0,
"escalated": 0.7 # Positive score for appropriate escalation
}
)
# Empathy and tone judge
empathy_judge = custom_prompt_judge(
name="empathy_score",
prompt_template="""Assess the emotional intelligence of the response:
Customer Emotion: {{customer_emotion}}
Agent Response: {{agent_response}}
Rate the empathy shown:
[[exceptional]]: Acknowledges emotions, validates concerns, shows genuine care
[[good]]: Shows understanding and appropriate concern
[[adequate]]: Professional but somewhat impersonal
[[poor]]: Cold, dismissive, or inappropriate emotional response""",
numeric_values={
"exceptional": 1.0,
"good": 0.75,
"adequate": 0.5,
"poor": 0.0
}
)
# Create a comprehensive customer service scorer
@scorer
def customer_service_quality(inputs, outputs, trace):
"""Comprehensive customer service evaluation"""
feedbacks = []
# Evaluate resolution status
feedbacks.append(resolution_judge(
customer_message=inputs.get("message", ""),
agent_response=outputs.get("response", ""),
issue_type=inputs.get("issue_type", "general")
))
# Evaluate empathy if customer shows emotion
customer_emotion = inputs.get("detected_emotion", "neutral")
if customer_emotion in ["frustrated", "angry", "upset", "worried"]:
feedbacks.append(empathy_judge(
customer_emotion=customer_emotion,
agent_response=outputs.get("response", "")
))
return feedbacks
# Example evaluation
eval_data = [
{
"inputs": {
"message": "I've been waiting 3 weeks for my refund! This is unacceptable!",
"issue_type": "refund",
"detected_emotion": "angry"
},
"outputs": {
"response": "I completely understand your frustration - 3 weeks is far too long to wait for a refund. I'm escalating this to our finance team immediately. You'll receive your refund within 24 hours, plus a $50 credit for the inconvenience. I'm also sending you my direct email so you can reach me if there are any other delays."
}
}
]
results = mlflow.genai.evaluate(
data=eval_data,
scorers=[customer_service_quality]
)
内容质量评估
# Technical documentation quality judge
doc_quality_judge = custom_prompt_judge(
name="documentation_quality",
prompt_template="""Evaluate this technical documentation:
Content: {{content}}
Target Audience: {{audience}}
Documentation Type: {{doc_type}}
Rate the documentation quality:
[[excellent]]: Clear, complete, well-structured with examples, appropriate depth
[[good]]: Covers topic well, mostly clear, could use minor improvements
[[fair]]: Basic coverage, some unclear sections, missing important details
[[poor]]: Confusing, incomplete, or significantly flawed""",
numeric_values={
"excellent": 1.0,
"good": 0.75,
"fair": 0.4,
"poor": 0.0
}
)
# Marketing copy effectiveness
marketing_judge = custom_prompt_judge(
name="marketing_effectiveness",
prompt_template="""Rate this marketing copy's effectiveness:
Copy: {{copy}}
Product: {{product}}
Target Demographic: {{target_demographic}}
Call to Action: {{cta}}
Evaluate effectiveness:
[[highly_effective]]: Compelling, clear value prop, strong CTA, perfect for audience
[[effective]]: Good messaging, decent CTA, reasonably targeted
[[moderately_effective]]: Some good elements but lacks impact or clarity
[[ineffective]]: Weak messaging, unclear value, poor audience fit""",
numeric_values={
"highly_effective": 1.0,
"effective": 0.7,
"moderately_effective": 0.4,
"ineffective": 0.0
}
)
代码评审质量
# Security review judge
security_review_judge = custom_prompt_judge(
name="security_review_quality",
prompt_template="""Evaluate the security aspects of this code review:
Original Code: {{code}}
Review Comments: {{review_comments}}
Security Vulnerabilities Found: {{vulnerabilities_mentioned}}
Rate the security review quality:
[[comprehensive]]: Identifies all security issues, explains risks, suggests secure alternatives
[[thorough]]: Catches major security flaws, good explanations
[[basic]]: Identifies obvious security issues only
[[insufficient]]: Misses critical security vulnerabilities""",
numeric_values={
"comprehensive": 1.0,
"thorough": 0.75,
"basic": 0.4,
"insufficient": 0.0
}
)
# Code clarity feedback judge
code_clarity_judge = custom_prompt_judge(
name="code_clarity_feedback",
prompt_template="""Assess the code review's feedback on readability:
Original Code Complexity: {{complexity_score}}
Review Feedback: {{review_comments}}
Readability Issues Identified: {{readability_issues}}
Rate the clarity feedback:
[[excellent]]: Identifies all clarity issues, suggests specific improvements, considers maintainability
[[good]]: Points out main clarity problems with helpful suggestions
[[adequate]]: Basic feedback on obvious readability issues
[[minimal]]: Superficial or missing important clarity feedback""",
numeric_values={
"excellent": 1.0,
"good": 0.7,
"adequate": 0.4,
"minimal": 0.0
}
)
医疗保健通信
# Patient communication appropriateness
patient_comm_judge = custom_prompt_judge(
name="patient_communication",
prompt_template="""Evaluate this healthcare provider's response to a patient:
Patient Question: {{patient_question}}
Provider Response: {{provider_response}}
Patient Health Literacy Level: {{health_literacy}}
Sensitive Topics: {{sensitive_topics}}
Rate communication appropriateness:
[[excellent]]: Clear, compassionate, appropriate language level, addresses concerns fully
[[good]]: Generally clear and caring, minor room for improvement
[[acceptable]]: Adequate but could be clearer or more empathetic
[[poor]]: Unclear, uses too much jargon, or lacks appropriate empathy""",
numeric_values={
"excellent": 1.0,
"good": 0.75,
"acceptable": 0.5,
"poor": 0.0
}
)
# Clinical note quality
clinical_note_judge = custom_prompt_judge(
name="clinical_note_quality",
prompt_template="""Assess this clinical note's quality:
Note Content: {{note_content}}
Note Type: {{note_type}}
Required Elements: {{required_elements}}
Rate the clinical documentation:
[[comprehensive]]: All required elements present, clear, follows standards, actionable
[[complete]]: Most elements present, generally clear, minor gaps
[[incomplete]]: Missing important elements or lacks clarity
[[deficient]]: Significant gaps, unclear, or doesn't meet documentation standards""",
numeric_values={
"comprehensive": 1.0,
"complete": 0.7,
"incomplete": 0.3,
"deficient": 0.0
}
)
成对响应比较
使用基于提示的法官比较两个响应并确定哪个响应更好。 这对于 A/B 测试、模型比较或首选项学习非常有用。
注释
成对比较判断器不能与 mlflow.evaluate()
或作为评分器一起使用,因为它们同时评估两个响应,而不是单个响应。 将它们直接用于比较分析。
from mlflow.genai.judges import custom_prompt_judge
# Response preference judge
preference_judge = custom_prompt_judge(
name="response_preference",
prompt_template="""Compare these two responses to the same question and determine which is better:
Question: {{question}}
Response A: {{response_a}}
Response B: {{response_b}}
Evaluation Criteria:
1. Accuracy and completeness of information
2. Clarity and ease of understanding
3. Helpfulness and actionability
4. Appropriate tone for the context
Choose your preference:
[[strongly_prefer_a]]: Response A is significantly better across most criteria
[[slightly_prefer_a]]: Response A is marginally better overall
[[equal]]: Both responses are equally good (or equally poor)
[[slightly_prefer_b]]: Response B is marginally better overall
[[strongly_prefer_b]]: Response B is significantly better across most criteria""",
numeric_values={
"strongly_prefer_a": -1.0,
"slightly_prefer_a": -0.5,
"equal": 0.0,
"slightly_prefer_b": 0.5,
"strongly_prefer_b": 1.0
}
)
# Example usage for model comparison
question = "How do I improve my GenAI app's response quality?"
response_model_v1 = """To improve response quality, you should:
1. Add more training data
2. Fine-tune your model
3. Use better prompts"""
response_model_v2 = """To improve your GenAI app's response quality, consider these strategies:
1. **Enhance your prompts**: Use clear, specific instructions with examples
2. **Implement evaluation**: Use MLflow's LLM judges to measure quality systematically
3. **Collect feedback**: Gather user feedback to identify improvement areas
4. **Iterate on weak areas**: Focus on responses that score poorly
5. **A/B test changes**: Compare versions to ensure improvements
Start with evaluation to establish a baseline, then iterate based on data."""
# Compare responses
feedback = preference_judge(
question=question,
response_a=response_model_v1,
response_b=response_model_v2
)
print(f"Preference: {feedback.metadata['string_value']}") # "strongly_prefer_b"
print(f"Score: {feedback.value}") # 1.0
print(f"Rationale: {feedback.rationale}")
专用比较法官
# Technical accuracy comparison for documentation
tech_comparison_judge = custom_prompt_judge(
name="technical_comparison",
prompt_template="""Compare these two technical explanations:
Topic: {{topic}}
Target Audience: {{audience}}
Explanation A: {{explanation_a}}
Explanation B: {{explanation_b}}
Focus on:
- Technical accuracy and precision
- Appropriate depth for the audience
- Use of examples and analogies
- Completeness without overwhelming detail
Which explanation is better?
[[a_much_better]]: A is significantly more accurate and appropriate
[[a_slightly_better]]: A is marginally better in accuracy or clarity
[[equivalent]]: Both are equally good technically
[[b_slightly_better]]: B is marginally better in accuracy or clarity
[[b_much_better]]: B is significantly more accurate and appropriate""",
numeric_values={
"a_much_better": -1.0,
"a_slightly_better": -0.5,
"equivalent": 0.0,
"b_slightly_better": 0.5,
"b_much_better": 1.0
}
)
# Empathy comparison for customer service
empathy_comparison_judge = custom_prompt_judge(
name="empathy_comparison",
prompt_template="""Compare the emotional intelligence of these customer service responses:
Customer Situation: {{situation}}
Customer Emotion: {{emotion}}
Agent Response A: {{response_a}}
Agent Response B: {{response_b}}
Evaluate which response better:
- Acknowledges the customer's emotions
- Shows genuine understanding and care
- Offers appropriate emotional support
- Maintains professional boundaries
Which response shows better emotional intelligence?
[[a_far_superior]]: A shows much better emotional intelligence
[[a_better]]: A is somewhat more empathetic
[[both_good]]: Both show good emotional intelligence
[[b_better]]: B is somewhat more empathetic
[[b_far_superior]]: B shows much better emotional intelligence""",
numeric_values={
"a_far_superior": -1.0,
"a_better": -0.5,
"both_good": 0.0,
"b_better": 0.5,
"b_far_superior": 1.0
}
)
实用比较工作流
# Compare outputs from different prompt versions
def compare_prompt_versions(test_cases, prompt_v1, prompt_v2, model_client):
"""Compare two prompt versions across multiple test cases"""
results = []
for test_case in test_cases:
# Generate responses with each prompt
response_v1 = model_client.generate(prompt_v1.format(**test_case))
response_v2 = model_client.generate(prompt_v2.format(**test_case))
# Compare responses
feedback = preference_judge(
question=test_case["question"],
response_a=response_v1,
response_b=response_v2
)
results.append({
"question": test_case["question"],
"preference": feedback.metadata["string_value"],
"score": feedback.value,
"rationale": feedback.rationale
})
# Analyze results
avg_score = sum(r["score"] for r in results) / len(results)
if avg_score < -0.2:
print(f"Prompt V1 is preferred (avg score: {avg_score:.2f})")
elif avg_score > 0.2:
print(f"Prompt V2 is preferred (avg score: {avg_score:.2f})")
else:
print(f"Prompts perform similarly (avg score: {avg_score:.2f})")
return results
# Compare different model outputs
def compare_models(questions, model_a, model_b, comparison_judge):
"""Compare two models across a set of questions"""
win_counts = {"model_a": 0, "model_b": 0, "tie": 0}
for question in questions:
response_a = model_a.generate(question)
response_b = model_b.generate(question)
feedback = comparison_judge(
question=question,
response_a=response_a,
response_b=response_b
)
# Count wins based on preference strength
if feedback.value <= -0.5:
win_counts["model_a"] += 1
elif feedback.value >= 0.5:
win_counts["model_b"] += 1
else:
win_counts["tie"] += 1
print(f"Model comparison results: {win_counts}")
return win_counts
高级使用模式
条件评分
根据上下文实现不同的评估条件:
@scorer
def adaptive_quality_scorer(inputs, outputs, trace):
"""Applies different judges based on context"""
# Determine which judge to use based on input characteristics
query_type = inputs.get("query_type", "general")
if query_type == "technical":
judge = custom_prompt_judge(
name="technical_response",
prompt_template="""Rate this technical response:
Question: {{question}}
Response: {{response}}
Required Depth: {{depth_level}}
[[expert]]: Demonstrates deep expertise, includes advanced concepts
[[proficient]]: Good technical accuracy, appropriate depth
[[basic]]: Correct but lacks depth or nuance
[[incorrect]]: Contains technical errors or misconceptions""",
numeric_values={
"expert": 1.0,
"proficient": 0.75,
"basic": 0.5,
"incorrect": 0.0
}
)
return judge(
question=inputs["question"],
response=outputs["response"],
depth_level=inputs.get("required_depth", "intermediate")
)
elif query_type == "support":
judge = custom_prompt_judge(
name="support_response",
prompt_template="""Rate this support response:
Issue: {{issue}}
Response: {{response}}
Customer Status: {{customer_status}}
[[excellent]]: Solves issue completely, proactive, appropriate for customer status
[[good]]: Addresses issue well, professional
[[fair]]: Partially helpful but incomplete
[[poor]]: Unhelpful or inappropriate""",
numeric_values={
"excellent": 1.0,
"good": 0.7,
"fair": 0.4,
"poor": 0.0
}
)
return judge(
issue=inputs["question"],
response=outputs["response"],
customer_status=inputs.get("customer_status", "standard")
)
对聚合策略进行评分
智能地组合多个法官分数:
@scorer
def weighted_quality_scorer(inputs, outputs, trace):
"""Combines multiple judges with weighted scoring"""
# Define judges and their weights
judges_config = [
{
"judge": custom_prompt_judge(
name="accuracy",
prompt_template="...", # Your accuracy template
numeric_values={"high": 1.0, "medium": 0.5, "low": 0.0}
),
"weight": 0.4,
"args": {"question": inputs["question"], "response": outputs["response"]}
},
{
"judge": custom_prompt_judge(
name="completeness",
prompt_template="...", # Your completeness template
numeric_values={"complete": 1.0, "partial": 0.5, "incomplete": 0.0}
),
"weight": 0.3,
"args": {"response": outputs["response"], "requirements": inputs.get("requirements", [])}
},
{
"judge": custom_prompt_judge(
name="clarity",
prompt_template="...", # Your clarity template
numeric_values={"clear": 1.0, "adequate": 0.6, "unclear": 0.0}
),
"weight": 0.3,
"args": {"response": outputs["response"]}
}
]
# Collect all feedbacks
feedbacks = []
weighted_score = 0.0
for config in judges_config:
feedback = config["judge"](**config["args"])
feedbacks.append(feedback)
# Add to weighted score if numeric
if isinstance(feedback.value, (int, float)):
weighted_score += feedback.value * config["weight"]
# Add composite score as additional feedback
from mlflow.entities import Feedback
composite_feedback = Feedback(
name="weighted_quality_score",
value=weighted_score,
rationale=f"Weighted combination of {len(judges_config)} quality dimensions"
)
feedbacks.append(composite_feedback)
return feedbacks
最佳做法
设计有效选项
1. 做出相互排斥和详尽的选择
# Good - clear distinctions, covers all cases
"""[[approved]]: Meets all requirements, ready for production
[[needs_revision]]: Has issues that must be fixed before approval
[[rejected]]: Fundamental flaws, requires complete rework"""
# Bad - overlapping and ambiguous
"""[[good]]: The response is good
[[okay]]: The response is okay
[[fine]]: The response is fine"""
2.为每个选择提供特定条件
# Good - specific, measurable criteria
"""[[secure]]: No vulnerabilities, follows all security best practices, includes input validation
[[mostly_secure]]: Minor security concerns that should be addressed but aren't critical
[[insecure]]: Contains vulnerabilities that could be exploited"""
# Bad - vague criteria
"""[[secure]]: Looks secure
[[not_secure]]: Has problems"""
3. 逻辑顺序选择 (最好到最差)
# Good - clear progression
numeric_values = {
"exceptional": 1.0,
"good": 0.75,
"satisfactory": 0.5,
"needs_improvement": 0.25,
"unacceptable": 0.0
}
数字刻度设计
1. 在评委之间使用一致的评分标准
# All judges use 0-1 scale
quality_judge = custom_prompt_judge(..., numeric_values={"high": 1.0, "medium": 0.5, "low": 0.0})
accuracy_judge = custom_prompt_judge(..., numeric_values={"accurate": 1.0, "partial": 0.5, "wrong": 0.0})
2. 留出空白,供将来优化
# Allows adding intermediate levels later
numeric_values = {
"excellent": 1.0,
"good": 0.7, # Gap allows for "very_good" at 0.85
"fair": 0.4, # Gap allows for "satisfactory" at 0.55
"poor": 0.0
}
3.考虑领域特定的尺度
# Academic grading scale
academic_scale = {
"A": 4.0,
"B": 3.0,
"C": 2.0,
"D": 1.0,
"F": 0.0
}
# Net Promoter Score scale
nps_scale = {
"promoter": 1.0, # 9-10
"passive": 0.0, # 7-8
"detractor": -1.0 # 0-6
}
提示设计技巧
1. 清晰地构建提示
prompt_template = """[Clear Task Description]
Evaluate the technical accuracy of this response.
[Context Section]
Question: {{question}}
Response: {{response}}
Technical Domain: {{___domain}}
[Evaluation Criteria]
Consider: factual accuracy, appropriate depth, correct terminology
[Choice Definitions]
[[accurate]]: All technical facts correct, appropriate level of detail
[[mostly_accurate]]: Minor inaccuracies that don't affect core understanding
[[inaccurate]]: Contains significant errors or misconceptions"""
2.在有用时包括示例
prompt_template = """Assess the urgency level of this support ticket.
Ticket: {{ticket_content}}
Examples of each level:
- Critical: System down, data loss, security breach
- High: Major feature broken, blocking work
- Medium: Performance issues, non-critical bugs
- Low: Feature requests, minor UI issues
Choose urgency level:
[[critical]]: Immediate attention required, business impact
[[high]]: Urgent, significant user impact
[[medium]]: Important but not urgent
[[low]]: Can be addressed in normal workflow"""
与基于准则的法官的比较
方面 | 基于准则 | 基于提示 |
---|---|---|
评估类型 | 二进制传递/失败 | 多层分类 |
得分 | “是”或“否” | 具有可选数值的自定义选项 |
最适用于 | 符合性、策略遵循性 | 质量评估,满意度评分 |
迭代速度 | 非常快 - 只需更新指南文本 | 中等 - 可能需要调整选项 |
业务用户友好 | ✅ 高 - 自然语言规则 | ⚠️ 中等 - 需要理解选项和完整提示 |
聚合 | 统计合格率/不合格率 | 计算平均值,跟踪趋势 |
验证和错误处理
选择验证
法官验证:
- 使用
[[choice_name]]
格式正确定义选择项 - 选择名称是字母数字(可以包含下划线)
- 模板中至少定义了一个选项
# This will raise an error - no choices defined
invalid_judge = custom_prompt_judge(
name="invalid",
prompt_template="Rate the response: {{response}}"
)
# ValueError: Prompt template must include choices denoted with [[CHOICE_NAME]]
数值验证
使用 numeric_values
时,必须映射所有选项:
# This will raise an error - missing choice in numeric_values
invalid_judge = custom_prompt_judge(
name="invalid",
prompt_template="""Choose:
[[option_a]]: First option
[[option_b]]: Second option""",
numeric_values={"option_a": 1.0} # Missing option_b
)
# ValueError: numeric_values keys must match the choices
模板变量验证
缺少的模板变量在执行过程中引发错误:
judge = custom_prompt_judge(
name="test",
prompt_template="{{request}} {{response}} [[good]]: Good"
)
# This will raise an error - missing 'response' variable
judge(request="Hello")
# KeyError: Template variable 'response' not found
后续步骤
- 创建基于提示的评分器 - 实施基于提示的评估器的分步指南
- 基于准则的评判标准 - 更简单的通过/失败标准替代方法
- 自定义记分器概述 - 了解如何将法官功能集成到自定义记分器中