自定义评估器

2025-05-20

内置评估器非常适合开箱即用，可以开始评估应用程序的代系。但是，你可能想要构建自己的基于代码或基于提示的评估器，以满足特定的评估需求。

基于代码的评估器

有时，某些评估指标不需要大型语言模型。这时，基于代码的评估器让你能够灵活地基于函数或可调用类定义指标。可以生成自己的基于代码的评估程序，例如，通过创建一个简单的 Python 类来计算目录 answer_length.py 下 answer_len/ 中的答案长度：

基于代码的计算器示例：答案长度

class AnswerLengthEvaluator:
    def __init__(self):
        pass
    # A class is made a callable my implementing the special method __call__
    def __call__(self, *, answer: str, **kwargs):
        return {"answer_length": len(answer)}

然后，通过导入可调用的类，对一行数据运行评估程序：

from answer_len.answer_length import AnswerLengthEvaluator

answer_length_evaluator = AnswerLengthEvaluator()
answer_length = answer_length_evaluator(answer="What is the speed of light?")

基于代码的计算器输出：答案长度

{"answer_length":27}

基于提示的评估器

若要构建自己的基于提示的大型语言模型评估器或 AI 辅助式批注器，可以根据 Prompty 文件创建自定义评估器。 Prompty 是一个扩展名为 .prompty 的文件，用于开发提示模板。 Prompty 资产是一个包含已修改前面内容的 Markdown 文件。前面内容采用 YAML 格式，包含许多元数据字段，用于定义模型配置和 Prompty 的预期输入。让我们创建自定义评估程序 FriendlinessEvaluator 来衡量响应的友好性。

基于提示的计算器示例：友好性计算器

首先，创建一个 friendliness.prompty 文件来描述友好指标的定义及其评分标准：

---
name: Friendliness Evaluator
description: Friendliness Evaluator to measure warmth and approachability of answers.
model:
  api: chat
  configuration:
    type: azure_openai
    azure_endpoint: ${env:AZURE_OPENAI_ENDPOINT}
    azure_deployment: gpt-4o-mini
  parameters:
    model:
    temperature: 0.1
inputs:
  response:
    type: string
outputs:
  score:
    type: int
  explanation:
    type: string
---

system:
Friendliness assesses the warmth and approachability of the answer. Rate the friendliness of the response between one to five stars using the following scale:

One star: the answer is unfriendly or hostile

Two stars: the answer is mostly unfriendly

Three stars: the answer is neutral

Four stars: the answer is mostly friendly

Five stars: the answer is very friendly

Please assign a rating between 1 and 5 based on the tone and demeanor of the response.

**Example 1**
generated_query: I just don't feel like helping you! Your questions are getting very annoying.
output:
{"score": 1, "reason": "The response is not warm and is resisting to be providing helpful information."}
**Example 2**
generated_query: I'm sorry this watch is not working for you. Very happy to assist you with a replacement.
output:
{"score": 5, "reason": "The response is warm and empathetic, offering a resolution with care."}


**Here the actual conversation to be scored:**
generated_query: {{response}}
output:

然后创建一个类 FriendlinessEvaluator 以加载 Prompty 文件并使用 json 格式处理输出：

import os
import json
import sys
from promptflow.client import load_flow


class FriendlinessEvaluator:
    def __init__(self, model_config):
        current_dir = os.path.dirname(__file__)
        prompty_path = os.path.join(current_dir, "friendliness.prompty")
        self._flow = load_flow(source=prompty_path, model={"configuration": model_config})

    def __call__(self, *, response: str, **kwargs):
        llm_response = self._flow(response=response)
        try:
            response = json.loads(llm_response)
        except Exception as ex:
            response = llm_response
        return response

现在，可以创建自己的基于 Prompty 的计算器，并在一行数据上运行它：

from friendliness.friend import FriendlinessEvaluator

friendliness_eval = FriendlinessEvaluator(model_config)

friendliness_score = friendliness_eval(response="I will not apologize for my behavior!")

基于提示的计算器输出：友好性计算器

{
    'score': 1, 
    'reason': 'The response is hostile and unapologetic, lacking warmth or approachability.'
}

通过