常规用途计算器

2025-05-19

AI 系统可能会生成不一致的文本响应，或者缺少你可能希望的一般写作质量超出最低语法正确性。若要解决这些问题，请使用一致性和流畅性。

如果你有一个问答（QA）方案contextground truth以及除queryresponse数据外，还可以使用我们的 QAEvaluator 复合评估器，该复合评估器使用相关评估器进行判断。

AI 辅助评估器的模型配置

为了参考以下代码片段，AI 辅助计算器使用模型配置，如下所示：

import os
from azure.ai.evaluation import AzureOpenAIModelConfiguration
from dotenv import load_dotenv
load_dotenv()

model_config = AzureOpenAIModelConfiguration(
    azure_endpoint=os.environ["AZURE_ENDPOINT"],
    api_key=os.environ.get["AZURE_API_KEY"],
    azure_deployment=os.environ.get("AZURE_DEPLOYMENT_NAME"),
    api_version=os.environ.get("AZURE_API_VERSION"),
)

小窍门

我们建议使用 o3-mini 推理功能和成本效益的平衡。

一致性

CoherenceEvaluator 度量响应中思想的逻辑有序呈现，使读者能够轻松跟踪和理解作者的思想训练。连贯的响应直接回答问题，句子和段落之间的关系清晰，过渡自然且思路有逻辑。更高的分数意味着更好的一致性。

一致性示例

from azure.ai.evaluation import CoherenceEvaluator

coherence = CoherenceEvaluator(model_config=model_config, threshold=3)
coherence(
    query="Is Marie Curie is born in Paris?", 
    response="No, Marie Curie is born in Warsaw."
)

一致性输出

类似比例的数字分数（整数 1 到 5）和更高的分数更好。给定数值阈值（默认值为 3），如果分数 <= 阈值，则我们还输出“pass”，否则将输出“fail”。使用原因字段可帮助你了解分数高或低的原因。

{
    "coherence": 4.0,
    "gpt_coherence": 4.0,
    "coherence_reason": "The RESPONSE is coherent and directly answers the QUERY with relevant information, making it easy to follow and understand.",
    "coherence_result": "pass",
    "coherence_threshold": 3
}

流畅度

FluencyEvaluator衡量书面通信的有效性和清晰度，侧重于语法准确性、词汇范围、句子复杂性、一致性和整体可读性。它评估如何顺利传达想法，以及读者如何轻松地理解文本。

流利示例

from azure.ai.evaluation import FluencyEvaluator

fluency = FluencyEvaluator(model_config=model_config, threshold=3)
fluency(
    response="No, Marie Curie is born in Warsaw."
)

流利输出

{
    "fluency": 3.0,
    "gpt_fluency": 3.0,
    "fluency_reason": "The response is clear and grammatically correct, but it lacks complexity and variety in sentence structure, which is why it fits the \"Competent Fluency\" level.",
    "fluency_result": "pass",
    "fluency_threshold": 3
}

回答复合计算器的问题

QAEvaluator 在问答方案中全面衡量各方面：

相关性
真实性
流畅度
一致性
相似
F1 分数

QA 示例

from azure.ai.evaluation import QAEvaluator

qa_eval = QAEvaluator(model_config=model_config, threshold=3)
qa_eval(
    query="Where was Marie Curie born?", 
    context="Background: 1. Marie Curie was a chemist. 2. Marie Curie was born on November 7, 1867. 3. Marie Curie is a French scientist.",
    response="According to wikipedia, Marie Curie was not born in Paris but in Warsaw.",
    ground_truth="Marie Curie was born in Warsaw."
)

QA 输出

虽然 F1 分数在 0-1 浮点数刻度上输出数值分数，但其他计算器在类似比例（整数 1 到 5）上输出数值分数，分数越高越好。给定数值阈值（默认值为 3），如果分数 <= 阈值，则我们还输出“pass”，否则将输出“fail”。使用原因字段可帮助你了解分数高或低的原因。

{
    "f1_score": 0.631578947368421,
    "f1_result": "pass",
    "f1_threshold": 3,
    "similarity": 4.0,
    "gpt_similarity": 4.0,
    "similarity_result": "pass",
    "similarity_threshold": 3,
    "fluency": 3.0,
    "gpt_fluency": 3.0,
    "fluency_reason": "The input Data should get a Score of 3 because it clearly conveys an idea with correct grammar and adequate vocabulary, but it lacks complexity and variety in sentence structure.",
    "fluency_result": "pass",
    "fluency_threshold": 3,
    "relevance": 3.0,
    "gpt_relevance": 3.0,
    "relevance_reason": "The RESPONSE does not fully answer the QUERY because it fails to explicitly state that Marie Curie was born in Warsaw, which is the key detail needed for a complete understanding. Instead, it only negates Paris, which does not fully address the question.",
    "relevance_result": "pass",
    "relevance_threshold": 3,
    "coherence": 2.0,
    "gpt_coherence": 2.0,
    "coherence_reason": "The RESPONSE provides some relevant information but lacks a clear and logical structure, making it difficult to follow. It does not directly answer the question in a coherent manner, which is why it falls into the \"Poorly Coherent Response\" category.",
    "coherence_result": "fail",
    "coherence_threshold": 3,
    "groundedness": 3.0,
    "gpt_groundedness": 3.0,
    "groundedness_reason": "The response attempts to answer the query about Marie Curie's birthplace but includes incorrect information by stating she was not born in Paris, which is irrelevant. It does provide the correct birthplace (Warsaw), but the misleading nature of the response affects its overall groundedness. Therefore, it deserves a score of 3.",
    "groundedness_result": "pass",
    "groundedness_threshold": 3
}

通过