你当前正在访问 Microsoft Azure Global Edition 技术文档网站。如果需要访问由世纪互联运营的 Microsoft Azure 中国技术文档网站，请访问 https://docs.azure.cn。

Azure OpenAI Evaluation (预览版)

2025-06-05

大型语言模型的评估是衡量这些模型在不同任务和维度上的性能的关键步骤。这对于微调的模型尤其重要，评估训练的性能提升（或损失）对于这类模型至关重要。全面的评估有助于了解模型的不同版本如何影响应用程序或方案。

Azure OpenAI 评估使开发人员能够创建评估运行，以针对预期的输入/输出对进行测试，从而通过各种关键指标评估模型的性能，例如准确性、可靠性和整体性能。

评估支持

区域可用性

澳大利亚东部
巴西南部
加拿大中部
美国中部
美国东部 2
法国中部
德国中西部
意大利北部
日本东部
日本西部
韩国中部
美国中北部
挪威东部
波兰中部
南非北部
东南亚
西班牙中部
瑞典中部
瑞士北部
瑞士西部
阿拉伯联合酋长国北部
英国南部
英国西部
西欧
美国西部
西部美国 2
美国西部 3

如果首选区域缺失，请参阅 Azure OpenAI 区域，并检查它是 Azure OpenAI 区域可用性区域之一。

支持的部署类型

标准
全球标准
数据区域标准
预配托管
全局预配托管
数据区域预配托管

评估 API （预览版）

使用评估 API，可以直接通过 API 调用测试模型输出，并编程方式评估模型质量和性能。若要使用评估 API，请查看 REST API 文档。

评估管道

测试数据

你需要将要测试的内容集合到一个事实数据集。数据集创建通常是一个迭代过程，可确保评估在一段时间内与方案保持相关。此事实数据集通常是手动制作的，表示预期的模型行为。该数据集也会被标记，并包括预期的答案。

注意

某些评估测试(如“情绪”和“有效的 JSON 或 XML”)不需要事实数据。

数据源需要采用 JSONL 格式。下面是 JSONL 评估数据集的两个示例:

{"question": "Find the degree for the given field extension Q(sqrt(2), sqrt(3), sqrt(18)) over Q.", "subject": "abstract_algebra", "A": "0", "B": "4", "C": "2", "D": "6", "answer": "B", "completion": "B"}
{"question": "Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the index of <p> in S_5.", "subject": "abstract_algebra", "A": "8", "B": "2", "C": "24", "D": "120", "answer": "C", "completion": "C"}
{"question": "Find all zeros in the indicated finite field of the given polynomial with coefficients in that field. x^5 + 3x^3 + x^2 + 2x in Z_5", "subject": "abstract_algebra", "A": "0", "B": "1", "C": "0,1", "D": "0,4", "answer": "D", "completion": "D"}

上传并选择评估文件时，将返回前三行内容的预览:

{"input": [{"role": "system", "content": "Provide a clear and concise summary of the technical content, highlighting key concepts and their relationships. Focus on the main ideas and practical implications."}, {"role": "user", "content": "Tokenization is a key step in preprocessing for natural language processing, involving the division of text into smaller components called tokens. These can be words, subwords, or characters, depending on the method chosen. Word tokenization divides text at word boundaries, while subword techniques like Byte Pair Encoding (BPE) or WordPiece can manage unknown words by breaking them into subunits. Character tokenization splits text into individual characters, useful for multiple languages and misspellings. The tokenization method chosen greatly affects model performance and its capacity to handle various languages and vocabularies."}], "output": "Tokenization divides text into smaller units (tokens) for NLP applications, using word, subword (e.g., BPE), or character methods. Each has unique benefits, impacting model performance and language processing capabilities."}      
{"input": [{"role": "system", "content": "Create a comprehensive yet concise summary that captures the essential technical details and significance of the topic."}, {"role": "user", "content": "Self-attention mechanisms are vital in modern transformer models, allowing them to evaluate the relevance of different parts of an input sequence when processing each element. In self-attention, each position in a sequence learns to focus on all other positions, forming a weighted sum of their values. Attention weights are derived using queries, keys, and values, with compatibility between a query and key determining attention to each value. This mechanism enables capturing long-range dependencies and relationships within a sequence, making it effective for tasks needing context understanding and inter-part relationships."}], "output": "Self-attention is crucial in transformers, enabling each sequence position to evaluate and collect data from all others. It uses queries, keys, and values for attention weights, effectively capturing long-range dependencies and contextual links."}      
{"input": [{"role": "system", "content": "Synthesize the technical information into a brief, accurate summary that maintains the core concepts and their significance."}, {"role": "user", "content": "The transformer architecture significantly changed natural language processing by introducing a new method for sequence processing. Unlike recurrent neural networks, transformers process whole sequences at once using self-attention. The architecture consists of an encoder and decoder, each with multiple layers of multi-head attention, feed-forward networks, and normalization layers. The encoder processes the input sequence, and the decoder produces the output sequence. Positional encodings are added to retain sequence order information since the model lacks inherent sequence position awareness. This architecture underlies many leading language models like BERT and GPT."}], "output": "Transformers are a transformative neural architecture processing sequences with self-attention instead of recurrence. Their encoder-decoder design, with multi-head attention and position encodings, is fundamental to modern language models like BERT and GPT."}

上传并选择评估文件时，将返回前三行内容的预览:

你可以选择任何以前上传的现有数据集，或上传新的数据集。

创建响应（可选）

在评估中使用的提示应与计划在生产中使用的提示相匹配。这些提示提供了模型要遵循的说明。与操场体验类似，你可以创建多个输入，以在提示中添加小样本示例。有关详细信息，请参阅提示工程技术，详细了解提示设计和提示工程中的一些高级技术。

你可以使用 {{input.column_name}} 格式在提示中引用输入数据，其中 column_name 对应于输入文件中列的名称。

后续步骤将使用 {{sample.output_text}} 格式引用在评估过程中生成的输出。

注意

你需要使用双大括号来确保正确引用数据。

模型部署

在 Azure OpenAI 中，需要创建用于评估的模型部署。可以根据需求选择和部署单个模型或多个模型。当您使用所选的测试标准对您的基本模型或微调模型进行评分时，将使用这些模型部署。还可以使用已部署的模型为提供的提示自动生成响应。

列表中的可用部署取决于你在 Azure OpenAI 资源中创建的部署。如果找不到所需的部署，可以从 Azure OpenAI 评估页创建新的部署。

测试条件

测试条件用于评估目标模型生成的每个输出的有效性。这些测试将输入数据与输出数据进行比较，以确保一致性。你可以灵活地配置不同的条件来测试和测量不同级别的输出的质量和相关性。

单击每个测试条件时，会看到不同类型的分级师以及预设架构，你可以根据自己的评估数据集和条件修改这些架构。

入门

在 Azure AI Foundry 门户中选择 Azure OpenAI 评估（预览版）。若要查看此视图，可能需要首先在受支持的区域中选择一个现有的 Azure OpenAI 资源。
选择“+ 新建评估”
选择要如何提供用于评估的测试数据。可以导入存储的聊天完成、使用提供的默认模板创建数据，或上传自己的数据。让我们逐步讲解如何上传自己的数据。

选择您的评估数据，这些数据将以 .jsonl 格式。如果已有数据，可以选择一个数据，或上传新数据。

上传新数据时，会在右侧看到文件的前三行作为预览：

如果需要示例测试文件，可以使用此示例 .jsonl 文本。此示例包含各种技术内容的句子，我们将评估这些句子的语义相似性。

{"input": [{"role": "system", "content": "Provide a clear and concise summary of the technical content, highlighting key concepts and their relationships. Focus on the main ideas and practical implications."}, {"role": "user", "content": "Tokenization is a key step in preprocessing for natural language processing, involving the division of text into smaller components called tokens. These can be words, subwords, or characters, depending on the method chosen. Word tokenization divides text at word boundaries, while subword techniques like Byte Pair Encoding (BPE) or WordPiece can manage unknown words by breaking them into subunits. Character tokenization splits text into individual characters, useful for multiple languages and misspellings. The tokenization method chosen greatly affects model performance and its capacity to handle various languages and vocabularies."}], "output": "Tokenization divides text into smaller units (tokens) for NLP applications, using word, subword (e.g., BPE), or character methods. Each has unique benefits, impacting model performance and language processing capabilities."}      
{"input": [{"role": "system", "content": "Create a comprehensive yet concise summary that captures the essential technical details and significance of the topic."}, {"role": "user", "content": "Self-attention mechanisms are vital in modern transformer models, allowing them to evaluate the relevance of different parts of an input sequence when processing each element. In self-attention, each position in a sequence learns to focus on all other positions, forming a weighted sum of their values. Attention weights are derived using queries, keys, and values, with compatibility between a query and key determining attention to each value. This mechanism enables capturing long-range dependencies and relationships within a sequence, making it effective for tasks needing context understanding and inter-part relationships."}], "output": "Self-attention is crucial in transformers, enabling each sequence position to evaluate and collect data from all others. It uses queries, keys, and values for attention weights, effectively capturing long-range dependencies and contextual links."}      
{"input": [{"role": "system", "content": "Synthesize the technical information into a brief, accurate summary that maintains the core concepts and their significance."}, {"role": "user", "content": "The transformer architecture significantly changed natural language processing by introducing a new method for sequence processing. Unlike recurrent neural networks, transformers process whole sequences at once using self-attention. The architecture consists of an encoder and decoder, each with multiple layers of multi-head attention, feed-forward networks, and normalization layers. The encoder processes the input sequence, and the decoder produces the output sequence. Positional encodings are added to retain sequence order information since the model lacks inherent sequence position awareness. This architecture underlies many leading language models like BERT and GPT."}], "output": "Transformers are a transformative neural architecture processing sequences with self-attention instead of recurrence. Their encoder-decoder design, with multi-head attention and position encodings, is fundamental to modern language models like BERT and GPT."}      
{"input": [{"role": "system", "content": "Distill the technical content into a clear, concise summary that preserves the most important concepts and their practical applications."}, {"role": "user", "content": "Cosine similarity is a metric for measuring similarity between vectors by calculating the cosine of the angle between them. The outcome ranges from -1 to 1, with 1 indicating identical orientation, 0 orthogonality, and -1 opposite orientation. In machine learning and NLP, cosine similarity is useful for comparing document vectors, word embeddings, or any high-dimensional representations. It's often preferred over Euclidean distance because it's invariant to vector magnitude, focusing instead on orientation. This makes it particularly useful for comparing documents or embeddings of various lengths or scales."}], "output": "Cosine similarity evaluates vector similarity by computing their angle's cosine, with values from -1 to 1. It's widely used in ML and NLP for comparing document vectors and embeddings due to its magnitude-invariant nature."}      
{"input": [{"role": "system", "content": "Provide a succinct summary that captures the key technical concepts and their relationships while maintaining clarity and accuracy."}, {"role": "user", "content": "Neural network activation functions are crucial for introducing non-linearity to deep learning models. Commonly used activation functions include ReLU (Rectified Linear Unit), which outputs the input directly if positive and zero otherwise, sigmoid, which compresses values between 0 and 1, and tanh, which normalizes values between -1 and 1. Each function has specific benefits and applications: ReLU avoids vanishing gradients and enables sparse activation, sigmoid is useful for binary classification, and tanh is often used in hidden layers for its zero-centered output. The activation function choice greatly affects model training dynamics and ultimate performance."}], "output": "Activation functions introduce non-linearity to neural networks, with ReLU, sigmoid, and tanh being the most common. Each serves distinct purposes: ReLU prevents vanishing gradients, sigmoid is suitable for binary classification, and tanh offers zero-centered output, all affecting model performance differently."}      
{"input": [{"role": "system", "content": "Create a focused summary that emphasizes the main technical concepts and their practical implications while maintaining accuracy."}, {"role": "user", "content": "Batch normalization is a technique that stabilizes and accelerates deep neural network training by normalizing each layer's inputs. It normalizes activations using the mean and variance computed over the current data batch. This normalization reduces internal covariate shift, where the input distribution changes during training. Batch normalization includes learnable parameters (gamma and beta) allowing the network to reverse normalization if necessary. Benefits include enabling higher learning rates, lessening the importance of precise parameter initialization, and serving as a regularizer, potentially removing the need for dropout."}], "output": "Batch normalization stabilizes neural network training by normalizing layer inputs using batch statistics. It addresses internal covariate shift with learnable parameters, allowing higher learning rates and improved training dynamics, while offering regularization benefits."}      
{"input": [{"role": "system", "content": "Summarize the technical content while preserving the essential concepts and their interconnections, ensuring clarity and precision."}, {"role": "user", "content": "Gradient descent optimization algorithms are foundational to training machine learning models. The basic version updates parameters in the steepest descent direction of the loss function. Variants like Stochastic Gradient Descent (SGD) use random samples to estimate gradients, enhancing computational efficiency. Advanced optimizers like Adam combine momentum (helping overcome local minima) and RMSprop (adapting learning rates for each parameter). These optimizers maintain running averages of gradients and their squares, enabling adaptive learning rates and more efficient training. The optimizer choice can significantly influence model convergence speed and final performance."}], "output": "Gradient descent algorithms optimize ML models by updating parameters based on loss gradients. Variants like SGD use sampling for efficiency, while advanced optimizers like Adam combine momentum with adaptive learning rates for improved training performance."}      
{"input": [{"role": "system", "content": "Deliver a concise yet comprehensive summary that highlights the key technical aspects and their practical significance."}, {"role": "user", "content": "Cross-validation is a resampling method used to evaluate machine learning models on limited data. The most common form, k-fold cross-validation, involves dividing the dataset into k equal parts, training the model on k-1 folds while validating on the remaining fold, and rotating through all combinations. This provides a more robust model performance assessment than a single train-test split, helping detect overfitting and estimate model generalization to unseen data. Variations include stratified k-fold (maintaining class distribution), leave-one-out (k equals dataset size), and time series cross-validation for temporal data."}], "output": "Cross-validation evaluates ML models by training and testing on different data splits, typically using k-fold methodology. This approach offers better performance assessment than single splits, with variations for different data types and requirements."}      
{"input": [{"role": "system", "content": "Generate a clear and focused summary that captures the essential technical details while maintaining their relationships and significance."}, {"role": "user", "content": "Transfer learning is a machine learning method where a model developed for one task is reused as the starting point for a model on a second task. This approach is powerful in deep learning, where pre-trained models on large datasets (like ImageNet for computer vision or BERT for NLP) are fine-tuned on specific downstream tasks. Transfer learning reduces the need for large amounts of task-specific training data and computational resources, as the model has already learned useful features from the source ___domain. Common strategies include feature extraction (freezing pre-trained layers) and fine-tuning (updating all or some pre-trained weights)."}], "output": "Transfer learning reuses models trained on one task for different tasks, particularly effective in deep learning. It leverages pre-trained models through feature extraction or fine-tuning, reducing data and computational needs for new tasks."}      
{"input": [{"role": "system", "content": "Provide a precise and informative summary that distills the key technical concepts while maintaining their relationships and practical importance."}, {"role": "user", "content": "Ensemble methods combine multiple machine learning models to create a more robust and accurate predictor. Common techniques include bagging (training models on random data subsets), boosting (sequentially training models to correct earlier errors), and stacking (using a meta-model to combine base model predictions). Random Forests, a popular bagging method, create multiple decision trees using random feature subsets. Gradient Boosting builds trees sequentially, with each tree correcting the errors of previous ones. These methods often outperform single models by reducing overfitting and variance while capturing different data aspects."}], "output": "Ensemble methods enhance prediction accuracy by combining multiple models through techniques like bagging, boosting, and stacking. Popular implementations include Random Forests (using multiple trees with random features) and Gradient Boosting (sequential error correction), offering better performance than single models."}

如果要使用测试数据的输入创建新响应，可以选择“生成新响应”。这将将评估文件中的输入字段注入到你选择的模型生成输出的单个提示中。

你将选择所选的模型。如果没有模型，可以创建新的模型部署。所选模型将获取输入数据并生成其自己的唯一输出，在这种情况下，该输出将存储在名为 <a0/> 的变量中。然后，我们将稍后使用该输出作为测试条件的一部分。或者，可以手动提供自己的自定义系统消息和单个消息示例。

若要创建测试条件，请选择“ 添加”。对于我们提供的示例文件，我们将评估语义相似性。选择 模型评分器，其中包含语义相似性的测试条件预设。

选择顶部的 语义相似性 。滚动到底部，并在 User 中指定将 {{item.output}} 设置为 Ground truth，并将 {{sample.output_text}} 设置为 Output。这将采用评估 .jsonl 文件（提供的示例文件）的原始引用输出，并将其与在上一步中选择的模型生成的输出进行比较。

选择 “添加” 以添加此测试条件。如果要添加其他测试条件，可以在此步骤中添加它们。
你已准备好创建你的评估。提供评估名称，检查所有内容是否正确，并提交以创建评估作业。你将被带到评估作业的状态页，该页将显示状态为“正在等待”。

创建评估作业后，可以选择该作业以查看作业的完整详细信息：

对于语义相似性,“查看输出详细信息”包含可以复制/粘贴通过测试的 JSON 表示形式。

还可以通过在评估作业页面左上角选择 “+ 添加运行 ”按钮来添加更多 Eval 运行。

测试条件的类型

Azure OpenAI 评估在提供的示例中看到的语义相似性的基础上提供了各种评估测试标准。本部分更详细地介绍了每个测试条件。

真实性

通过将提交答案与专家答案进行比较来评估提交答案的事实准确性。

真实性功能通过将提交答案与专家答案进行比较来评估提交答案的事实准确性。使用详细的思考链 (CoT) 提示，评分者可确定提交的答案与专家的答案是相同或是相悖的，或者是专家答案的子集或超集。它无视风格、语法或标点符号的差异，只关注事实内容。在很多情况下，真实性功能非常有用，包括但不限于内容验证和教育工具，可确保 AI 提供的答案的准确性。

你可以通过选择提示旁边的下拉列表来查看用作此测试条件一部分的提示文本。当前提示文本为:

Prompt
You are comparing a submitted answer to an expert answer on a given question.
Here is the data:
[BEGIN DATA]
************
[Question]: {input}
************
[Expert]: {ideal}
************
[Submission]: {completion}
************
[END DATA]
Compare the factual content of the submitted answer with the expert answer. Ignore any differences in style, grammar, or punctuation.
The submitted answer may either be a subset or superset of the expert answer, or it may conflict with it. Determine which case applies. Answer the question by selecting one of the following options:
(A) The submitted answer is a subset of the expert answer and is fully consistent with it.
(B) The submitted answer is a superset of the expert answer and is fully consistent with it.
(C) The submitted answer contains all the same details as the expert answer.
(D) There is a disagreement between the submitted answer and the expert answer.
(E) The answers differ, but these differences don't matter from the perspective of factuality.

语义相似性

度量模型响应与引用之间的相似度。 Grades: 1 (completely different) - 5 (very similar)。

情绪

尝试识别输出的情感基调。

你可以通过选择提示旁边的下拉列表来查看用作此测试条件一部分的提示文本。当前提示文本为:

Prompt
You will be presented with a text generated by a large language model. Your job is to rate the sentiment of the text. Your options are:

A) Positive
B) Neutral
C) Negative
D) Unsure

[BEGIN TEXT]
***
[{text}]
***
[END TEXT]

First, write out in a step by step manner your reasoning about the answer to be sure that your conclusion is correct. Avoid simply stating the correct answers at the outset. Then print only the single character (without quotes or punctuation) on its own line corresponding to the correct answer. At the end, repeat just the letter again by itself on a new line

字符串检查

验证输出是否与预期的字符串完全匹配。

字符串检查对允许多种评估条件的两个字符串变量执行各种二进制操作。它有助于验证各种字符串关系，包括相等、包含和特定模式。此评估程序允许区分大小写或不区分大小写的比较。它还为 true 或 false 结果提供指定的分数，允许基于比较结果的自定义评估结果。下面是支持的操作类型:

equals: 检查输出字符串是否完全等于评估字符串。
contains：检查评估字符串是否为输出字符串的子字符串。
starts-with: 检查输出字符串是否以评估字符串开头。
ends-with: 检查输出字符串是否以评估字符串结尾。

注意

在测试条件中设置某些参数时，可以在“变量”和“模板”之间进行选择。如果要引用输入数据中的列，请选择“变量”。如果要提供固定字符串，请选择“模板”。

有效的 JSON 或 XML

验证输出是否为有效的 JSON 或 XML。

匹配架构

确保输出遵循指定的结构。

条件匹配

评估模型的响应是否符合条件。成绩: 通过或失败。

你可以通过选择提示旁边的下拉列表来查看用作此测试条件一部分的提示文本。当前提示文本为:

Prompt
Your job is to assess the final response of an assistant based on conversation history and provided criteria for what makes a good response from the assistant. Here is the data:

[BEGIN DATA]
***
[Conversation]: {conversation}
***
[Response]: {response}
***
[Criteria]: {criteria}
***
[END DATA]

Does the response meet the criteria? First, write out in a step by step manner your reasoning about the criteria to be sure that your conclusion is correct. Avoid simply stating the correct answers at the outset. Then print only the single character "Y" or "N" (without quotes or punctuation) on its own line corresponding to the correct answer. "Y" for yes if the response meets the criteria, and "N" for no if it does not. At the end, repeat just the letter again by itself on a new line.
Reasoning:

文本质量

通过与参考文本进行比较来评估文本的质量。

总结：

BLEU 评分: 通过使用 BLEU 分数将生成文本与一个或多个高质量的参考翻译进行比较来评估生成的文本的质量。
ROUGE 分数: 通过使用 ROUGE 分数将生成的文本与参考摘要进行比较来评估生成的文本的质量。
余弦值: 也称为余弦相似性，度量两个文本嵌入(如模型输出和参考文本)在含义上的一致性，有助于评估它们之间的语义相似性。这是通过测量它们在矢量空间的距离完成的。

详：

BLEU (双语评估替补)分数常用于自然语言处理 (NLP) 和机器翻译。它广泛应用于文本汇总和文本生成用例。它评估生成的文本与参考文本的匹配程度。 BLEU 分数范围为 0 到 1，分数越高表示质量越好。

ROUGE（以召回为导向的要点评估替补）是用于评估自动汇总和机器翻译的一组指标。它度量生成的文本与参考摘要之间的重叠。 ROUGE 注重使用以召回为导向的措施，来评估生成的文本与参考文本之间的覆盖程度。 ROUGE 的评分提供了多种指标，包括：

ROUGE-1：生成文本和参考文本之间的单元语法（单个字词）重叠。
ROUGE-2：生成文本和参考文本之间的二元语法（两个连续字词）重叠。
ROUGE-3：生成文本和参考文本之间的三元语法（三个连续字词）重叠。
ROUGE-4：生成文本和参考文本之间的四元语法（四个连续字词）重叠。
ROUGE-5：生成文本和参考文本之间的五元语法（五个连续字词）重叠。
ROUGE-L：生成文本和参考文本之间的 L 元语法（L 个连续字词）重叠。

文本汇总和文档比较是 ROUGE 的最佳用例之一，尤其是在文本连贯性和相关性至关重要的方案中。

余弦相似性度量两个文本嵌入(如模型输出和参考文本)在含义上的一致性，有助于评估它们之间的语义相似性。与其他基于模型的评估程序相同，需要提供用于评估的模型部署。

重要

此评估程序仅支持嵌入模型:

text-embedding-3-small
text-embedding-3-large
text-embedding-ada-002

自定义提示

使用模型将输出归为一组指定标签的集合。此评估程序使用需要定义的自定义提示。

通过

Azure OpenAI Evaluation (预览版)

评估支持

区域可用性

支持的部署类型

评估 API （预览版）

评估管道

测试数据

评估格式

创建响应（可选）

模型部署

测试条件

入门

测试条件的类型

真实性

语义相似性

情绪

字符串检查

有效的 JSON 或 XML

匹配架构

条件匹配

文本质量

自定义提示

反馈

其他资源