汎用エバリュエーター

2025-05-19

AI システムは、一貫性がないテキスト応答を生成したり、最小文法の正確性を超えて望む一般的な書き込み品質がない場合があります。これらの問題に対処するには、コヒーレンスと流暢さを使用します。

contextとground truthに加えて、queryデータとresponseデータの両方を含む質問回答 (QA) シナリオがある場合は、関連するエバリュエーターを使用する複合エバリュエーターを使用して判断することもできます。

AI 支援エバリュエーターのモデル構成

次のコードスニペットで参照するために、AI 支援エバリュエーターは次のようにモデル構成を使用します。

import os
from azure.ai.evaluation import AzureOpenAIModelConfiguration
from dotenv import load_dotenv
load_dotenv()

model_config = AzureOpenAIModelConfiguration(
    azure_endpoint=os.environ["AZURE_ENDPOINT"],
    api_key=os.environ.get["AZURE_API_KEY"],
    azure_deployment=os.environ.get("AZURE_DEPLOYMENT_NAME"),
    api_version=os.environ.get("AZURE_API_VERSION"),
)

ヒント

推論機能とコスト効率のバランスを取るための o3-mini を使用することをお勧めします。

一貫性

CoherenceEvaluator は、アイデアの論理的かつ順序的なプレゼンテーションを応答で測定し、読者がライターの思考のトレーニングに簡単に従って理解できるようにします。一貫性のある応答は、適切な遷移とアイデアの論理的な順序を使用して、文と段落の間に明確なつながりを持って質問に直接対処します。スコアが高いほど、一貫性が向上します。

コヒーレンスの例

from azure.ai.evaluation import CoherenceEvaluator

coherence = CoherenceEvaluator(model_config=model_config, threshold=3)
coherence(
    query="Is Marie Curie is born in Paris?", 
    response="No, Marie Curie is born in Warsaw."
)

コヒーレンス出力

likert スケール (整数 1 から 5) の数値スコアと、より高いスコアの方が優れています。数値のしきい値 (既定値は 3) を指定すると、スコア >= しきい値の場合は "pass" を出力し、それ以外の場合は "fail" も出力します。理由フィールドを使用すると、スコアが高いか低いかを理解するのに役立ちます。

{
    "coherence": 4.0,
    "gpt_coherence": 4.0,
    "coherence_reason": "The RESPONSE is coherent and directly answers the QUERY with relevant information, making it easy to follow and understand.",
    "coherence_result": "pass",
    "coherence_threshold": 3
}

流暢性

FluencyEvaluatorは、文法の正確さ、ボキャブラリ範囲、文の複雑さ、一貫性、全体的な読みやすさに焦点を当てて、書かれたコミュニケーションの有効性と明確さを測定します。これは、アイデアがどれだけスムーズに伝わるか、および読者がテキストをどれだけ簡単に理解できるかを評価します。

流暢な例

from azure.ai.evaluation import FluencyEvaluator

fluency = FluencyEvaluator(model_config=model_config, threshold=3)
fluency(
    response="No, Marie Curie is born in Warsaw."
)

流暢な出力

{
    "fluency": 3.0,
    "gpt_fluency": 3.0,
    "fluency_reason": "The response is clear and grammatically correct, but it lacks complexity and variety in sentence structure, which is why it fits the \"Competent Fluency\" level.",
    "fluency_result": "pass",
    "fluency_threshold": 3
}

複合エバリュエーターに回答する質問

QAEvaluator は、質問に回答するシナリオで包括的にさまざまな側面を測定します。

関連性
地に足のついた状態
流暢性
一貫性
相似
F1 スコア

QA の例

from azure.ai.evaluation import QAEvaluator

qa_eval = QAEvaluator(model_config=model_config, threshold=3)
qa_eval(
    query="Where was Marie Curie born?", 
    context="Background: 1. Marie Curie was a chemist. 2. Marie Curie was born on November 7, 1867. 3. Marie Curie is a French scientist.",
    response="According to wikipedia, Marie Curie was not born in Paris but in Warsaw.",
    ground_truth="Marie Curie was born in Warsaw."
)

QA 出力

F1 スコアは 0 から 1 の浮動小数点スケールで数値スコアを出力しますが、他のエバリュエーターは likert スケール (整数 1 から 5) で数値スコアを出力し、より高いスコアが優れています。数値のしきい値 (既定値は 3) を指定すると、スコア >= しきい値の場合は "pass" を出力し、それ以外の場合は "fail" も出力します。理由フィールドを使用すると、スコアが高いか低いかを理解するのに役立ちます。

{
    "f1_score": 0.631578947368421,
    "f1_result": "pass",
    "f1_threshold": 3,
    "similarity": 4.0,
    "gpt_similarity": 4.0,
    "similarity_result": "pass",
    "similarity_threshold": 3,
    "fluency": 3.0,
    "gpt_fluency": 3.0,
    "fluency_reason": "The input Data should get a Score of 3 because it clearly conveys an idea with correct grammar and adequate vocabulary, but it lacks complexity and variety in sentence structure.",
    "fluency_result": "pass",
    "fluency_threshold": 3,
    "relevance": 3.0,
    "gpt_relevance": 3.0,
    "relevance_reason": "The RESPONSE does not fully answer the QUERY because it fails to explicitly state that Marie Curie was born in Warsaw, which is the key detail needed for a complete understanding. Instead, it only negates Paris, which does not fully address the question.",
    "relevance_result": "pass",
    "relevance_threshold": 3,
    "coherence": 2.0,
    "gpt_coherence": 2.0,
    "coherence_reason": "The RESPONSE provides some relevant information but lacks a clear and logical structure, making it difficult to follow. It does not directly answer the question in a coherent manner, which is why it falls into the \"Poorly Coherent Response\" category.",
    "coherence_result": "fail",
    "coherence_threshold": 3,
    "groundedness": 3.0,
    "gpt_groundedness": 3.0,
    "groundedness_reason": "The response attempts to answer the query about Marie Curie's birthplace but includes incorrect information by stating she was not born in Paris, which is irrelevant. It does provide the correct birthplace (Warsaw), but the misleading nature of the response affects its overall groundedness. Therefore, it deserves a score of 3.",
    "groundedness_result": "pass",
    "groundedness_threshold": 3
}