プロンプトベースのジャッジ

2025-06-11

概要

プロンプトベースのジャッジは、カスタマイズ可能な選択肢カテゴリ (例: 優れた/良い/悪い) とオプションの数値スコアリングを使用して、複数レベルの品質評価を有効にします。二項合格/不合格評価を提供するガイドラインベースのジャッジとは異なり、プロンプトベースのジャッジは以下を提供します。

改善を 追跡するための数値マッピングを使用した段階的なスコアリングレベル
複雑な多次元評価基準の完全なプロンプト制御
ユースケースに合わせてカスタマイズされたドメイン固有のカテゴリ
時間の経過に伴う品質傾向を測定するための集計可能なメトリック

いつ使用するか

必要に応じて、プロンプトベースのジャッジを選択します。

合格/不合格を超えた複数レベルの品質評価
定量的分析とバージョン比較のための数値スコア
カスタムカテゴリを必要とする複雑な評価基準
データセット間でメトリックを集計する

必要に応じて、ガイドラインベースのジャッジを選択します。

単純な合格/失敗コンプライアンス評価
コーディングなしで条件を書き込む/更新するビジネス利害関係者
評価ルールの迅速な反復

Von Bedeutung

プロンプトベースのジャッジはスタンドアロンの API/SDK として使用できますが、評価ハーネスと運用監視サービスで使用するには、スコアラーにラップする必要があります。

例を実行するための前提条件

MLflow と必要なパッケージをインストールする
```
pip install --upgrade "mlflow[databricks]>=3.1.0"
```
環境のセットアップのクイックスタートに従って、MLflow 実験を作成します。

例

プロンプトベースのジャッジの力を示す簡単な例を次に示します。

from mlflow.genai.judges import custom_prompt_judge

# Create a multi-level quality judge
response_quality_judge = custom_prompt_judge(
    name="response_quality",
    prompt_template="""Evaluate the quality of this customer service response:

<request>{{request}}</request>
<response>{{response}}</response>

Choose the most appropriate rating:

[[excellent]]: Empathetic, complete solution, proactive help offered
[[good]]: Addresses the issue adequately with professional tone
[[poor]]: Incomplete, unprofessional, or misses key concerns""",
    numeric_values={
        "excellent": 1.0,
        "good": 0.7,
        "poor": 0.0
    }
)

# Direct usage
feedback = response_quality_judge(
    request="My order arrived damaged!",
    response="I'm so sorry to hear that. I've initiated a replacement order that will arrive tomorrow, and issued a full refund. Is there anything else I can help with?"
)

print(feedback.value)         # 1.0
print(feedback.metadata)      # {"string_value": "excellent"}
print(feedback.rationale)     # Detailed explanation of the rating

主要な概念

SDK の概要

custom_prompt_judge関数は、プロンプトテンプレートに基づいて入力を評価するカスタム LLM ジャッジを作成します。

from mlflow.genai.judges import custom_prompt_judge

judge = custom_prompt_judge(
    name="formality",
    prompt_template="...",  # Your custom prompt with {{variables}} and [[choices]]
    numeric_values={"formal": 1.0, "informal": 0.0}  # Optional numeric mapping
)

# Returns an mlflow.entities.Feedback object
feedback = judge(request="Hello", response="Hey there!")

パラメーター

パラメーター	タイプ	必須	説明
`name`	`str`	イエス	MLflow UI に表示され、ジャッジの出力を識別するために使用される評価の名前
`prompt_template`	`str`	イエス	次を含むテンプレート文字列: `{{variables}}`: 動的コンテンツのプレースホルダー `[[choices]]`: ジャッジが選択する必要がある選択定義
`numeric_values`	`dict[str, float] \\| None`	いいえ	選択名を数値スコアにマップします (0 から 1 のスケールをお勧めします)。なし: 文字列の選択肢の値を返します。 With: 数値スコアを返し、文字列の選択肢をメタデータに格納します
`model`	`str \\| None`	いいえ	使用する特定のジャッジモデル (MLflow の最適化されたジャッジモデルの既定値)

数値マッピングを使用する理由

複数の選択肢ラベル ("excellent"、"good"、"poor" など) がある場合、文字列値を使用すると、評価の実行全体で品質の向上を追跡することが困難になります。

数値マッピングを使用すると、次のことが可能になります。

定量的な比較: 平均品質が 0.6 から 0.8 に改善されたかどうかを確認する
メトリックの集計: データセット全体の平均スコアを計算する
バージョンの比較: 変更の品質が向上したか低下したかを追跡する
しきい値ベースの監視: 品質が許容レベルを下回った場合にアラートを設定する

数値がないと、ラベルの分布 (たとえば、40% "good"、60% "poor" など) しか表示されず、全体的な改善の測定が困難になります。

戻り値

この関数は、呼び出し可能な関数を返します。

プロンプトテンプレートの {{variables}} に一致するキーワード引数を受け入れます
次を含む mlflow.entities.Feedback オブジェクトを返します。
- value: 選択した選択肢 (文字列) または数値スコア (指定 numeric_values 場合)
- rationale: LLMがその選択をした理由の説明
- metadata: 数値を使用する場合の文字列の選択を含む追加情報
- name: 指定した名前
- error: 評価に失敗した場合のエラーの詳細

プロンプトテンプレートの要件

選択肢の定義の形式

選択肢は、二重角かっこ [[choice_name]]使用して定義する必要があります。

prompt_template = """Evaluate the response formality:

<request>{{request}}</request>
<response>{{response}}</response>

Select one category:

[[formal]]: Professional language, proper grammar, no contractions
[[semi_formal]]: Mix of professional and conversational elements
[[informal]]: Casual language, contractions, colloquialisms"""

変数プレースホルダー

動的コンテンツには、二重中括弧 {{variable}} を使用します。

prompt_template = """Assess if the response uses appropriate sources:

Question: {{question}}
Response: {{response}}
Available Sources: {{retrieved_documents}}
Citation Policy: {{citation_policy}}

Choose one:

[[well_cited]]: All claims properly cite available sources
[[partially_cited]]: Some claims cite sources, others do not
[[poorly_cited]]: Claims lack proper citations"""

一般的な評価パターン

Likert スケールパターン

標準の 5 ポイントまたは 7 ポイントの満足度スケールを作成します。

satisfaction_judge = custom_prompt_judge(
    name="customer_satisfaction",
    prompt_template="""Based on this interaction, rate the likely customer satisfaction:

Customer Request: {{request}}
Agent Response: {{response}}

Select satisfaction level:

[[very_satisfied]]: Response exceeds expectations with exceptional service
[[satisfied]]: Response meets expectations adequately
[[neutral]]: Response is acceptable but unremarkable
[[dissatisfied]]: Response fails to meet basic expectations
[[very_dissatisfied]]: Response is unhelpful or problematic""",
    numeric_values={
        "very_satisfied": 1.0,
        "satisfied": 0.75,
        "neutral": 0.5,
        "dissatisfied": 0.25,
        "very_dissatisfied": 0.0
    }
)

ルーブリックに基づいた採点

明確な条件で詳細なスコアリングルーブリックを実装します。

code_review_rubric = custom_prompt_judge(
    name="code_review_rubric",
    prompt_template="""Evaluate this code review using our quality rubric:

Original Code: {{original_code}}
Review Comments: {{review_comments}}
Code Type: {{code_type}}

Score the review quality:

[[comprehensive]]: Identifies all issues including edge cases, security concerns, performance implications, and suggests specific improvements with examples
[[thorough]]: Catches major issues and most minor ones, provides good suggestions but may miss some edge cases
[[adequate]]: Identifies obvious issues and provides basic feedback, misses subtle problems
[[superficial]]: Only catches surface-level issues, feedback is vague or generic
[[inadequate]]: Misses critical issues or provides incorrect feedback""",
    numeric_values={
        "comprehensive": 1.0,
        "thorough": 0.8,
        "adequate": 0.6,
        "superficial": 0.3,
        "inadequate": 0.0
    }
)

実際の例

顧客サービスの品質

from mlflow.genai.judges import custom_prompt_judge
from mlflow.genai.scorers import scorer
import mlflow

# Issue resolution status judge
resolution_judge = custom_prompt_judge(
    name="issue_resolution",
    prompt_template="""Evaluate if the customer's issue was resolved:

Customer Message: {{customer_message}}
Agent Response: {{agent_response}}
Issue Type: {{issue_type}}

Rate the resolution status:

[[fully_resolved]]: Issue completely addressed with clear solution provided
[[partially_resolved]]: Some progress made but follow-up needed
[[unresolved]]: Issue not addressed or solution unclear
[[escalated]]: Appropriately escalated to higher support tier""",
    numeric_values={
        "fully_resolved": 1.0,
        "partially_resolved": 0.5,
        "unresolved": 0.0,
        "escalated": 0.7  # Positive score for appropriate escalation
    }
)

# Empathy and tone judge
empathy_judge = custom_prompt_judge(
    name="empathy_score",
    prompt_template="""Assess the emotional intelligence of the response:

Customer Emotion: {{customer_emotion}}
Agent Response: {{agent_response}}

Rate the empathy shown:

[[exceptional]]: Acknowledges emotions, validates concerns, shows genuine care
[[good]]: Shows understanding and appropriate concern
[[adequate]]: Professional but somewhat impersonal
[[poor]]: Cold, dismissive, or inappropriate emotional response""",
    numeric_values={
        "exceptional": 1.0,
        "good": 0.75,
        "adequate": 0.5,
        "poor": 0.0
    }
)

# Create a comprehensive customer service scorer
@scorer
def customer_service_quality(inputs, outputs, trace):
    """Comprehensive customer service evaluation"""
    feedbacks = []

    # Evaluate resolution status
    feedbacks.append(resolution_judge(
        customer_message=inputs.get("message", ""),
        agent_response=outputs.get("response", ""),
        issue_type=inputs.get("issue_type", "general")
    ))

    # Evaluate empathy if customer shows emotion
    customer_emotion = inputs.get("detected_emotion", "neutral")
    if customer_emotion in ["frustrated", "angry", "upset", "worried"]:
        feedbacks.append(empathy_judge(
            customer_emotion=customer_emotion,
            agent_response=outputs.get("response", "")
        ))

    return feedbacks

# Example evaluation
eval_data = [
    {
        "inputs": {
            "message": "I've been waiting 3 weeks for my refund! This is unacceptable!",
            "issue_type": "refund",
            "detected_emotion": "angry"
        },
        "outputs": {
            "response": "I completely understand your frustration - 3 weeks is far too long to wait for a refund. I'm escalating this to our finance team immediately. You'll receive your refund within 24 hours, plus a $50 credit for the inconvenience. I'm also sending you my direct email so you can reach me if there are any other delays."
        }
    }
]

results = mlflow.genai.evaluate(
    data=eval_data,
    scorers=[customer_service_quality]
)

コンテンツ品質評価

# Technical documentation quality judge
doc_quality_judge = custom_prompt_judge(
    name="documentation_quality",
    prompt_template="""Evaluate this technical documentation:

Content: {{content}}
Target Audience: {{audience}}
Documentation Type: {{doc_type}}

Rate the documentation quality:

[[excellent]]: Clear, complete, well-structured with examples, appropriate depth
[[good]]: Covers topic well, mostly clear, could use minor improvements
[[fair]]: Basic coverage, some unclear sections, missing important details
[[poor]]: Confusing, incomplete, or significantly flawed""",
    numeric_values={
        "excellent": 1.0,
        "good": 0.75,
        "fair": 0.4,
        "poor": 0.0
    }
)

# Marketing copy effectiveness
marketing_judge = custom_prompt_judge(
    name="marketing_effectiveness",
    prompt_template="""Rate this marketing copy's effectiveness:

Copy: {{copy}}
Product: {{product}}
Target Demographic: {{target_demographic}}
Call to Action: {{cta}}

Evaluate effectiveness:

[[highly_effective]]: Compelling, clear value prop, strong CTA, perfect for audience
[[effective]]: Good messaging, decent CTA, reasonably targeted
[[moderately_effective]]: Some good elements but lacks impact or clarity
[[ineffective]]: Weak messaging, unclear value, poor audience fit""",
    numeric_values={
        "highly_effective": 1.0,
        "effective": 0.7,
        "moderately_effective": 0.4,
        "ineffective": 0.0
    }
)

コードレビューの品質

# Security review judge
security_review_judge = custom_prompt_judge(
    name="security_review_quality",
    prompt_template="""Evaluate the security aspects of this code review:

Original Code: {{code}}
Review Comments: {{review_comments}}
Security Vulnerabilities Found: {{vulnerabilities_mentioned}}

Rate the security review quality:

[[comprehensive]]: Identifies all security issues, explains risks, suggests secure alternatives
[[thorough]]: Catches major security flaws, good explanations
[[basic]]: Identifies obvious security issues only
[[insufficient]]: Misses critical security vulnerabilities""",
    numeric_values={
        "comprehensive": 1.0,
        "thorough": 0.75,
        "basic": 0.4,
        "insufficient": 0.0
    }
)

# Code clarity feedback judge
code_clarity_judge = custom_prompt_judge(
    name="code_clarity_feedback",
    prompt_template="""Assess the code review's feedback on readability:

Original Code Complexity: {{complexity_score}}
Review Feedback: {{review_comments}}
Readability Issues Identified: {{readability_issues}}

Rate the clarity feedback:

[[excellent]]: Identifies all clarity issues, suggests specific improvements, considers maintainability
[[good]]: Points out main clarity problems with helpful suggestions
[[adequate]]: Basic feedback on obvious readability issues
[[minimal]]: Superficial or missing important clarity feedback""",
    numeric_values={
        "excellent": 1.0,
        "good": 0.7,
        "adequate": 0.4,
        "minimal": 0.0
    }
)

医療コミュニケーション

# Patient communication appropriateness
patient_comm_judge = custom_prompt_judge(
    name="patient_communication",
    prompt_template="""Evaluate this healthcare provider's response to a patient:

Patient Question: {{patient_question}}
Provider Response: {{provider_response}}
Patient Health Literacy Level: {{health_literacy}}
Sensitive Topics: {{sensitive_topics}}

Rate communication appropriateness:

[[excellent]]: Clear, compassionate, appropriate language level, addresses concerns fully
[[good]]: Generally clear and caring, minor room for improvement
[[acceptable]]: Adequate but could be clearer or more empathetic
[[poor]]: Unclear, uses too much jargon, or lacks appropriate empathy""",
    numeric_values={
        "excellent": 1.0,
        "good": 0.75,
        "acceptable": 0.5,
        "poor": 0.0
    }
)

# Clinical note quality
clinical_note_judge = custom_prompt_judge(
    name="clinical_note_quality",
    prompt_template="""Assess this clinical note's quality:

Note Content: {{note_content}}
Note Type: {{note_type}}
Required Elements: {{required_elements}}

Rate the clinical documentation:

[[comprehensive]]: All required elements present, clear, follows standards, actionable
[[complete]]: Most elements present, generally clear, minor gaps
[[incomplete]]: Missing important elements or lacks clarity
[[deficient]]: Significant gaps, unclear, or doesn't meet documentation standards""",
    numeric_values={
        "comprehensive": 1.0,
        "complete": 0.7,
        "incomplete": 0.3,
        "deficient": 0.0
    }
)

ペアワイズ応答の比較

プロンプトベースのジャッジを使用して、2 つの応答を比較し、どちらが優れているかを判断します。これは、A/B テスト、モデル比較、または優先学習に役立ちます。

注

ペアワイズ比較ジャッジは、1 つの応答ではなく 2 つの応答を同時に評価するため、 mlflow.evaluate() やスコアラーとして使用することはできません。比較分析に直接使用します。

from mlflow.genai.judges import custom_prompt_judge

# Response preference judge
preference_judge = custom_prompt_judge(
    name="response_preference",
    prompt_template="""Compare these two responses to the same question and determine which is better:

Question: {{question}}

Response A: {{response_a}}

Response B: {{response_b}}

Evaluation Criteria:
1. Accuracy and completeness of information
2. Clarity and ease of understanding
3. Helpfulness and actionability
4. Appropriate tone for the context

Choose your preference:

[[strongly_prefer_a]]: Response A is significantly better across most criteria
[[slightly_prefer_a]]: Response A is marginally better overall
[[equal]]: Both responses are equally good (or equally poor)
[[slightly_prefer_b]]: Response B is marginally better overall
[[strongly_prefer_b]]: Response B is significantly better across most criteria""",
    numeric_values={
        "strongly_prefer_a": -1.0,
        "slightly_prefer_a": -0.5,
        "equal": 0.0,
        "slightly_prefer_b": 0.5,
        "strongly_prefer_b": 1.0
    }
)

# Example usage for model comparison
question = "How do I improve my GenAI app's response quality?"

response_model_v1 = """To improve response quality, you should:
1. Add more training data
2. Fine-tune your model
3. Use better prompts"""

response_model_v2 = """To improve your GenAI app's response quality, consider these strategies:

1. **Enhance your prompts**: Use clear, specific instructions with examples
2. **Implement evaluation**: Use MLflow's LLM judges to measure quality systematically
3. **Collect feedback**: Gather user feedback to identify improvement areas
4. **Iterate on weak areas**: Focus on responses that score poorly
5. **A/B test changes**: Compare versions to ensure improvements

Start with evaluation to establish a baseline, then iterate based on data."""

# Compare responses
feedback = preference_judge(
    question=question,
    response_a=response_model_v1,
    response_b=response_model_v2
)

print(f"Preference: {feedback.metadata['string_value']}")  # "strongly_prefer_b"
print(f"Score: {feedback.value}")  # 1.0
print(f"Rationale: {feedback.rationale}")

専門比較ジャッジ

# Technical accuracy comparison for documentation
tech_comparison_judge = custom_prompt_judge(
    name="technical_comparison",
    prompt_template="""Compare these two technical explanations:

Topic: {{topic}}
Target Audience: {{audience}}

Explanation A: {{explanation_a}}

Explanation B: {{explanation_b}}

Focus on:
- Technical accuracy and precision
- Appropriate depth for the audience
- Use of examples and analogies
- Completeness without overwhelming detail

Which explanation is better?

[[a_much_better]]: A is significantly more accurate and appropriate
[[a_slightly_better]]: A is marginally better in accuracy or clarity
[[equivalent]]: Both are equally good technically
[[b_slightly_better]]: B is marginally better in accuracy or clarity
[[b_much_better]]: B is significantly more accurate and appropriate""",
    numeric_values={
        "a_much_better": -1.0,
        "a_slightly_better": -0.5,
        "equivalent": 0.0,
        "b_slightly_better": 0.5,
        "b_much_better": 1.0
    }
)

# Empathy comparison for customer service
empathy_comparison_judge = custom_prompt_judge(
    name="empathy_comparison",
    prompt_template="""Compare the emotional intelligence of these customer service responses:

Customer Situation: {{situation}}
Customer Emotion: {{emotion}}

Agent Response A: {{response_a}}

Agent Response B: {{response_b}}

Evaluate which response better:
- Acknowledges the customer's emotions
- Shows genuine understanding and care
- Offers appropriate emotional support
- Maintains professional boundaries

Which response shows better emotional intelligence?

[[a_far_superior]]: A shows much better emotional intelligence
[[a_better]]: A is somewhat more empathetic
[[both_good]]: Both show good emotional intelligence
[[b_better]]: B is somewhat more empathetic
[[b_far_superior]]: B shows much better emotional intelligence""",
    numeric_values={
        "a_far_superior": -1.0,
        "a_better": -0.5,
        "both_good": 0.0,
        "b_better": 0.5,
        "b_far_superior": 1.0
    }
)

実用的な比較ワークフロー

# Compare outputs from different prompt versions
def compare_prompt_versions(test_cases, prompt_v1, prompt_v2, model_client):
    """Compare two prompt versions across multiple test cases"""
    results = []

    for test_case in test_cases:
        # Generate responses with each prompt
        response_v1 = model_client.generate(prompt_v1.format(**test_case))
        response_v2 = model_client.generate(prompt_v2.format(**test_case))

        # Compare responses
        feedback = preference_judge(
            question=test_case["question"],
            response_a=response_v1,
            response_b=response_v2
        )

        results.append({
            "question": test_case["question"],
            "preference": feedback.metadata["string_value"],
            "score": feedback.value,
            "rationale": feedback.rationale
        })

    # Analyze results
    avg_score = sum(r["score"] for r in results) / len(results)

    if avg_score < -0.2:
        print(f"Prompt V1 is preferred (avg score: {avg_score:.2f})")
    elif avg_score > 0.2:
        print(f"Prompt V2 is preferred (avg score: {avg_score:.2f})")
    else:
        print(f"Prompts perform similarly (avg score: {avg_score:.2f})")

    return results

# Compare different model outputs
def compare_models(questions, model_a, model_b, comparison_judge):
    """Compare two models across a set of questions"""
    win_counts = {"model_a": 0, "model_b": 0, "tie": 0}

    for question in questions:
        response_a = model_a.generate(question)
        response_b = model_b.generate(question)

        feedback = comparison_judge(
            question=question,
            response_a=response_a,
            response_b=response_b
        )

        # Count wins based on preference strength
        if feedback.value <= -0.5:
            win_counts["model_a"] += 1
        elif feedback.value >= 0.5:
            win_counts["model_b"] += 1
        else:
            win_counts["tie"] += 1

    print(f"Model comparison results: {win_counts}")
    return win_counts

高度な使用パターン

条件付きスコアリング

コンテキストに基づいて異なる評価基準を実装します。

@scorer
def adaptive_quality_scorer(inputs, outputs, trace):
    """Applies different judges based on context"""

    # Determine which judge to use based on input characteristics
    query_type = inputs.get("query_type", "general")

    if query_type == "technical":
        judge = custom_prompt_judge(
            name="technical_response",
            prompt_template="""Rate this technical response:

Question: {{question}}
Response: {{response}}
Required Depth: {{depth_level}}

[[expert]]: Demonstrates deep expertise, includes advanced concepts
[[proficient]]: Good technical accuracy, appropriate depth
[[basic]]: Correct but lacks depth or nuance
[[incorrect]]: Contains technical errors or misconceptions""",
            numeric_values={
                "expert": 1.0,
                "proficient": 0.75,
                "basic": 0.5,
                "incorrect": 0.0
            }
        )

        return judge(
            question=inputs["question"],
            response=outputs["response"],
            depth_level=inputs.get("required_depth", "intermediate")
        )

    elif query_type == "support":
        judge = custom_prompt_judge(
            name="support_response",
            prompt_template="""Rate this support response:

Issue: {{issue}}
Response: {{response}}
Customer Status: {{customer_status}}

[[excellent]]: Solves issue completely, proactive, appropriate for customer status
[[good]]: Addresses issue well, professional
[[fair]]: Partially helpful but incomplete
[[poor]]: Unhelpful or inappropriate""",
            numeric_values={
                "excellent": 1.0,
                "good": 0.7,
                "fair": 0.4,
                "poor": 0.0
            }
        )

        return judge(
            issue=inputs["question"],
            response=outputs["response"],
            customer_status=inputs.get("customer_status", "standard")
        )

スコア付け集計戦略

複数のジャッジスコアをインテリジェントに組み合わせる:

@scorer
def weighted_quality_scorer(inputs, outputs, trace):
    """Combines multiple judges with weighted scoring"""

    # Define judges and their weights
    judges_config = [
        {
            "judge": custom_prompt_judge(
                name="accuracy",
                prompt_template="...",  # Your accuracy template
                numeric_values={"high": 1.0, "medium": 0.5, "low": 0.0}
            ),
            "weight": 0.4,
            "args": {"question": inputs["question"], "response": outputs["response"]}
        },
        {
            "judge": custom_prompt_judge(
                name="completeness",
                prompt_template="...",  # Your completeness template
                numeric_values={"complete": 1.0, "partial": 0.5, "incomplete": 0.0}
            ),
            "weight": 0.3,
            "args": {"response": outputs["response"], "requirements": inputs.get("requirements", [])}
        },
        {
            "judge": custom_prompt_judge(
                name="clarity",
                prompt_template="...",  # Your clarity template
                numeric_values={"clear": 1.0, "adequate": 0.6, "unclear": 0.0}
            ),
            "weight": 0.3,
            "args": {"response": outputs["response"]}
        }
    ]

    # Collect all feedbacks
    feedbacks = []
    weighted_score = 0.0

    for config in judges_config:
        feedback = config["judge"](**config["args"])
        feedbacks.append(feedback)

        # Add to weighted score if numeric
        if isinstance(feedback.value, (int, float)):
            weighted_score += feedback.value * config["weight"]

    # Add composite score as additional feedback
    from mlflow.entities import Feedback
    composite_feedback = Feedback(
        name="weighted_quality_score",
        value=weighted_score,
        rationale=f"Weighted combination of {len(judges_config)} quality dimensions"
    )
    feedbacks.append(composite_feedback)

    return feedbacks

ベストプラクティス

効果的な選択肢の設計

1. 相互に排他的かつ網羅的な選択を行う

# Good - clear distinctions, covers all cases
"""[[approved]]: Meets all requirements, ready for production
[[needs_revision]]: Has issues that must be fixed before approval
[[rejected]]: Fundamental flaws, requires complete rework"""

# Bad - overlapping and ambiguous
"""[[good]]: The response is good
[[okay]]: The response is okay
[[fine]]: The response is fine"""

2. 選択ごとに特定の条件を指定する

# Good - specific, measurable criteria
"""[[secure]]: No vulnerabilities, follows all security best practices, includes input validation
[[mostly_secure]]: Minor security concerns that should be addressed but aren't critical
[[insecure]]: Contains vulnerabilities that could be exploited"""

# Bad - vague criteria
"""[[secure]]: Looks secure
[[not_secure]]: Has problems"""

3. 選択肢を論理的に並べ替える (最善から最悪)

# Good - clear progression
numeric_values = {
    "exceptional": 1.0,
    "good": 0.75,
    "satisfactory": 0.5,
    "needs_improvement": 0.25,
    "unacceptable": 0.0
}

数値スケールの設計

1. ジャッジ間で一貫したスケールを使用する

# All judges use 0-1 scale
quality_judge = custom_prompt_judge(..., numeric_values={"high": 1.0, "medium": 0.5, "low": 0.0})
accuracy_judge = custom_prompt_judge(..., numeric_values={"accurate": 1.0, "partial": 0.5, "wrong": 0.0})

2. 将来の改良のためにギャップを残す

# Allows adding intermediate levels later
numeric_values = {
    "excellent": 1.0,
    "good": 0.7,    # Gap allows for "very_good" at 0.85
    "fair": 0.4,    # Gap allows for "satisfactory" at 0.55
    "poor": 0.0
}

3. ドメイン固有のスケールを検討する

# Academic grading scale
academic_scale = {
    "A": 4.0,
    "B": 3.0,
    "C": 2.0,
    "D": 1.0,
    "F": 0.0
}

# Net Promoter Score scale
nps_scale = {
    "promoter": 1.0,      # 9-10
    "passive": 0.0,       # 7-8
    "detractor": -1.0     # 0-6
}

プロンプト設計のコツ

1. プロンプトを明確に構築する

prompt_template = """[Clear Task Description]
Evaluate the technical accuracy of this response.

[Context Section]
Question: {{question}}
Response: {{response}}
Technical Domain: {{___domain}}

[Evaluation Criteria]
Consider: factual accuracy, appropriate depth, correct terminology

[Choice Definitions]
[[accurate]]: All technical facts correct, appropriate level of detail
[[mostly_accurate]]: Minor inaccuracies that don't affect core understanding
[[inaccurate]]: Contains significant errors or misconceptions"""

2. 役に立つ場合は例を含める

prompt_template = """Assess the urgency level of this support ticket.

Ticket: {{ticket_content}}

Examples of each level:
- Critical: System down, data loss, security breach
- High: Major feature broken, blocking work
- Medium: Performance issues, non-critical bugs
- Low: Feature requests, minor UI issues

Choose urgency level:
[[critical]]: Immediate attention required, business impact
[[high]]: Urgent, significant user impact
[[medium]]: Important but not urgent
[[low]]: Can be addressed in normal workflow"""

ガイドラインベースのジャッジとの比較

特徴	ガイドラインに基づく	プロンプトベース
評価の種類	バイナリのパス/失敗	複数レベルのカテゴリ
採点	"yes" または "no"	オプションの数値を使用したカスタム選択
最適な用途	コンプライアンス、ポリシーの準拠	品質評価、満足度評価
反復速度	非常に高速 - 単にガイドラインテキストを更新	中 - 選択肢を調整する必要がある場合があります
ビジネスユーザーフレンドリー	✅ 高 - 自然言語の規則	⚠️ 中級 - 選択肢と完全なプロンプトを理解する必要があります
集計	合格/失敗率のカウント	平均を計算し、傾向を追跡する

検証とエラー処理

選択肢の検証

裁判官は次のことを検証します。

選択肢は [[choice_name]] 形式で適切に定義されています
選択肢の名前は英数字（アンダースコアを含む）にすることができます。
テンプレートで少なくとも 1 つの選択肢が定義されている

# This will raise an error - no choices defined
invalid_judge = custom_prompt_judge(
    name="invalid",
    prompt_template="Rate the response: {{response}}"
)
# ValueError: Prompt template must include choices denoted with [[CHOICE_NAME]]

数値の検証

numeric_valuesを使用する場合は、すべての選択肢をマップする必要があります。

# This will raise an error - missing choice in numeric_values
invalid_judge = custom_prompt_judge(
    name="invalid",
    prompt_template="""Choose:
    [[option_a]]: First option
    [[option_b]]: Second option""",
    numeric_values={"option_a": 1.0}  # Missing option_b
)
# ValueError: numeric_values keys must match the choices

テンプレート変数の検証

テンプレート変数が見つからないと、実行中にエラーが発生します。

judge = custom_prompt_judge(
    name="test",
    prompt_template="{{request}} {{response}} [[good]]: Good"
)

# This will raise an error - missing 'response' variable
judge(request="Hello")
# KeyError: Template variable 'response' not found

次のステップ

プロンプトベースのスコアラーを作成する - プロンプトベースのジャッジを実装するためのステップバイステップガイド
ガイドラインベースのジャッジ - 合格/不合格基準のよりシンプルな代替手段
カスタムスコアラーの概要 - カスタムスコアラーでジャッジをラップする方法について説明します