你当前正在访问 Microsoft Azure Global Edition 技术文档网站。 如果需要访问由世纪互联运营的 Microsoft Azure 中国技术文档网站,请访问 https://docs.azure.cn

Azure OpenAI 推理模型

Azure OpenAI o-series 模型设计用于处理推理任务和解决问题的任务,具有更好的针对性和功能。 这些模型将更多时间花费在处理和理解用户的请求上,与以前的更迭版本相比,它们在科学、编码和数学等领域表现得异常强大。

o 系列模型的主要功能

  • 复杂代码生成:能够生成算法并处理高级编码任务以支持开发人员。
  • 高级问题解决:非常适合全面的头脑风暴会议和解决多方面的挑战。
  • 复杂文档比较:非常适合分析合同、案例文件或法律文档以识别细微的差别。
  • 指令遵循和工作流管理:对于管理需要较短上下文的工作流特别有效。

可用性

区域可用性

型号 区域 有限的访问权限
o3-pro 美国东部 2 和瑞典中部(全球标准) 请求访问权限:o3 受限访问模型应用程序。 如果你已经有 o3 access,则不需要针对 o3-pro 的请求。
codex-mini 美国东部 2 和瑞典中部(全球标准) 不需要访问请求。
o4-mini 模型可用性 使用此模型的核心功能不需要访问请求。

请求访问权限:o4 微型推理摘要功能
o3 模型可用性 请求访问权限:o3 受限访问模型应用程序
o3-mini 模型可用性 此模型不再限制访问。
o1 模型可用性 此模型不再限制访问。
o1-preview 模型可用性 此模型仅适用于在原始受限访问版本中被授予访问权限的客户。 我们当前没有扩展对 o1-preview 的访问权限。
o1-mini 模型可用性 全局标准部署不需要访问权限请求。

标准(区域)部署目前仅适用于在发布o1-preview 版本期间被选择并授予访问权限的客户。

API 和功能支持

特征 codex-mini2025-05-16 o3-pro2025-06-10 o4-mini2025-04-16 o32025-04-16 o3-mini2025-01-31 o12024-12-17 o1-preview2024-09-12 o1-mini2024-09-12
API 版本 2025-04-01-preview & v1 预览版 2025-04-01-preview & v1 预览版 2025-04-01-preview 2025-04-01-preview 2024-12-01-preview 或更高版本
2025-03-01-preview (推荐)
2024-12-01-preview 或更高版本
2025-03-01-preview (推荐)
2024-09-01-preview 或更高版本
2025-03-01-preview (推荐)
2024-09-01-preview 或更高版本
2025-03-01-preview (推荐)
开发人员消息 - -
结构化输出 - -
上下文窗口 输入:200,000
输出:10万
输入:200,000
输出:10万
输入:200,000
输出:10万
输入:200,000
输出:10万
输入:200,000
输出:10万
输入:200,000
输出:10万
输入:128,000
输出:32,768
输入:128,000
输出:65,536
推理工作 - -
图像输入 - - -
聊天补全 API - -
响应 API - - - -
函数/工具 - -
并行工具调用 - - - - - - - -
max_completion_tokens1
系统消息 2 - -
推理摘要3 - - - - -
流式处理4 - - - -

1 推理模型仅适用于 max_completion_tokens 参数。

2 最新的 o* 系列模型支持系统消息,以便更轻松地迁移。 当您使用带有 o4-minio3o3-minio1 的系统消息时,它将被视为开发人员消息。 不应在同一 API 请求中使用开发人员消息和系统消息。 3 访问链式推理摘要的权限仅限于o3o4-mini4 适用于 o3 的流式处理仅具有有限的访问权限。

注意

  • 为避免超时,建议使用o3-pro后台模式
  • o3-pro 当前不支持映像生成。

不支持

推理模型当前不支持以下各项:

  • temperaturetop_ppresence_penaltyfrequency_penaltylogprobstop_logprobslogit_biasmax_tokens

使用情况

这些模型 当前不支持与 使用聊天完成 API 的其他模型相同的参数集。

需要升级 OpenAI 客户端库才能访问最新的参数。

pip install openai --upgrade

如果不熟悉如何使用 Microsoft Entra ID 进行身份验证,请参阅 如何使用 Microsoft Entra ID 身份验证在 Azure AI Foundry 模型中配置 Azure OpenAI

from openai import AzureOpenAI
from azure.identity import DefaultAzureCredential, get_bearer_token_provider

token_provider = get_bearer_token_provider(
    DefaultAzureCredential(), "https://cognitiveservices.azure.com/.default"
)

client = AzureOpenAI(
  azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT"), 
  azure_ad_token_provider=token_provider,
  api_version="2025-03-01-preview"
)

response = client.chat.completions.create(
    model="o1-new", # replace with the model deployment name of your o1-preview, or o1-mini model
    messages=[
        {"role": "user", "content": "What steps should I think about when writing my first Python API?"},
    ],
    max_completion_tokens = 5000

)

print(response.model_dump_json(indent=2))

Python 输出:

{
  "id": "chatcmpl-AEj7pKFoiTqDPHuxOcirA9KIvf3yz",
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null,
      "message": {
        "content": "Writing your first Python API is an exciting step in developing software that can communicate with other applications. An API (Application Programming Interface) allows different software systems to interact with each other, enabling data exchange and functionality sharing. Here are the steps you should consider when creating your first Python API...truncated for brevity.",
        "refusal": null,
        "role": "assistant",
        "function_call": null,
        "tool_calls": null
      },
      "content_filter_results": {
        "hate": {
          "filtered": false,
          "severity": "safe"
        },
        "protected_material_code": {
          "filtered": false,
          "detected": false
        },
        "protected_material_text": {
          "filtered": false,
          "detected": false
        },
        "self_harm": {
          "filtered": false,
          "severity": "safe"
        },
        "sexual": {
          "filtered": false,
          "severity": "safe"
        },
        "violence": {
          "filtered": false,
          "severity": "safe"
        }
      }
    }
  ],
  "created": 1728073417,
  "model": "o1-2024-12-17",
  "object": "chat.completion",
  "service_tier": null,
  "system_fingerprint": "fp_503a95a7d8",
  "usage": {
    "completion_tokens": 1843,
    "prompt_tokens": 20,
    "total_tokens": 1863,
    "completion_tokens_details": {
      "audio_tokens": null,
      "reasoning_tokens": 448
    },
    "prompt_tokens_details": {
      "audio_tokens": null,
      "cached_tokens": 0
    }
  },
  "prompt_filter_results": [
    {
      "prompt_index": 0,
      "content_filter_results": {
        "custom_blocklists": {
          "filtered": false
        },
        "hate": {
          "filtered": false,
          "severity": "safe"
        },
        "jailbreak": {
          "filtered": false,
          "detected": false
        },
        "self_harm": {
          "filtered": false,
          "severity": "safe"
        },
        "sexual": {
          "filtered": false,
          "severity": "safe"
        },
        "violence": {
          "filtered": false,
          "severity": "safe"
        }
      }
    }
  ]
}

推理工作

注意

推理模型在模型响应中将 reasoning_tokens 作为 completion_tokens_details 的一部分。 这些是隐藏的标记,不会作为消息响应内容的一部分返回,但模型会使用它们来帮助生成对请求的最终答复。 2024-12-01-preview 添加了额外的新参数 reasoning_effort,你可以使用最新的 low 模型将该参数设置为 mediumhigho1。 工作量设置越高,模型处理请求所花费的时间就越长,这通常会产生更多的 reasoning_tokens

开发人员消息

从功能上来说,开发人员消息 "role": "developer" 与系统消息相同。

在上一代码示例中添加开发人员消息,如下所示:

需要升级 OpenAI 客户端库才能访问最新的参数。

pip install openai --upgrade

如果不熟悉如何使用 Microsoft Entra ID 进行身份验证,请参阅 如何使用 Microsoft Entra ID 身份验证配置 Azure OpenAI

from openai import AzureOpenAI
from azure.identity import DefaultAzureCredential, get_bearer_token_provider

token_provider = get_bearer_token_provider(
    DefaultAzureCredential(), "https://cognitiveservices.azure.com/.default"
)

client = AzureOpenAI(
  azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT"), 
  azure_ad_token_provider=token_provider,
  api_version="2025-03-01-preview"
)

response = client.chat.completions.create(
    model="o1-new", # replace with the model deployment name of your o1-preview, or o1-mini model
    messages=[
        {"role": "developer","content": "You are a helpful assistant."}, # optional equivalent to a system message for reasoning models 
        {"role": "user", "content": "What steps should I think about when writing my first Python API?"},
    ],
    max_completion_tokens = 5000,
    reasoning_effort = "medium" # low, medium, or high

)

print(response.model_dump_json(indent=2))

推理摘要

将最新 o3 模型和 o4-mini 模型与 响应 API 配合使用时,可以使用推理摘要参数来接收模型的思维推理链摘要。 此参数可以设置为autoconcisedetailed。 访问此功能需要你 请求访问

注意

即使启用,也不会为每个步骤/请求生成推理摘要。 这是预期的行为。

需要升级 OpenAI 客户端库才能访问最新的参数。

pip install openai --upgrade
from openai import AzureOpenAI
from azure.identity import DefaultAzureCredential, get_bearer_token_provider

token_provider = get_bearer_token_provider(
    DefaultAzureCredential(), "https://cognitiveservices.azure.com/.default"
)

client = AzureOpenAI(  
  base_url = "https://YOUR-RESOURCE-NAME.openai.azure.com/openai/v1/",  
  azure_ad_token_provider=token_provider,
  api_version="preview"
)

response = client.responses.create(
    input="Tell me about the curious case of neural text degeneration",
    model="o4-mini", # replace with model deployment name
    reasoning={
        "effort": "medium",
        "summary": "detailed" # auto, concise, or detailed (currently only supported with o4-mini and o3)
    }
)

print(response.model_dump_json(indent=2))
{
  "id": "resp_68007e26b2cc8190b83361014f3a78c50ae9b88522c3ad24",
  "created_at": 1744862758.0,
  "error": null,
  "incomplete_details": null,
  "instructions": null,
  "metadata": {},
  "model": "o4-mini",
  "object": "response",
  "output": [
    {
      "id": "rs_68007e2773bc8190b5b8089949bfe13a0ae9b88522c3ad24",
      "summary": [
        {
          "text": "**Summarizing neural text degeneration**\n\nThe user's asking about \"The Curious Case of Neural Text Degeneration,\" a paper by Ari Holtzman et al. from 2020. It explains how certain decoding strategies produce repetitive and dull text. In contrast, methods like nucleus sampling yield more coherent and diverse outputs. The authors introduce metrics like surprisal and distinct-n for evaluation and suggest that maximum likelihood decoding often favors generic continuations, leading to loops and repetitive patterns in longer texts. They promote sampling from truncated distributions for improved text quality.",
          "type": "summary_text"
        },
        {
          "text": "**Explaining nucleus sampling**\n\nThe authors propose nucleus sampling, which captures a specified mass of the predictive distribution, improving metrics such as coherence and diversity. They identify a \"sudden drop\" phenomenon in token probabilities, where a few tokens dominate, leading to a long tail. By truncating this at a cumulative probability threshold, they aim to enhance text quality compared to top-k sampling. Their evaluations include human assessments, showing better results in terms of BLEU scores and distinct-n measures. Overall, they highlight how decoding strategies influence quality and recommend adaptive techniques for improved outcomes.",
          "type": "summary_text"
        }
      ],
      "type": "reasoning",
      "status": null
    },
    {
      "id": "msg_68007e35c44881908cb4651b8e9972300ae9b88522c3ad24",
      "content": [
        {
          "annotations": [],
          "text": "Researchers first became aware that neural language models, when used to generate long stretches of text with standard “maximum‐likelihood” decoding (greedy search, beam search, etc.), often produce bland, repetitive or looping output. The 2020 paper “The Curious Case of Neural Text Degeneration” (Holtzman et al.) analyzes this failure mode and proposes a simple fix—nucleus (top‑p) sampling—that dramatically improves output quality.\n\n1. The Problem: Degeneration  \n   • With greedy or beam search, models tend to pick very high‑probability tokens over and over, leading to loops (“the the the…”) or generic, dull continuations.  \n   • Even sampling with a fixed top‑k (e.g. always sample from the 40 most likely tokens) can be suboptimal: if the model’s probability mass is skewed, k may be too small (overly repetitive) or too large (introducing incoherence).\n\n2. Why It Happens: Distributional Peakedness  \n   • At each time step the model’s predicted next‐token distribution often has one or two very high‑probability tokens, then a long tail of low‑probability tokens.  \n   • Maximum‐likelihood decoding zeroes in on the peak, collapsing diversity.  \n   • Uniform sampling over a large k allows low‑probability “wild” tokens, harming coherence.\n\n3. The Fix: Nucleus (Top‑p) Sampling  \n   • Rather than fixing k, dynamically truncate the distribution to the smallest set of tokens whose cumulative probability ≥ p (e.g. p=0.9).  \n   • Then renormalize and sample from that “nucleus.”  \n   • This keeps only the “plausible” mass and discards the improbable tail, adapting to each context.\n\n4. Empirical Findings  \n   • Automatic metrics (distinct‑n, repetition rates) and human evaluations show nucleus sampling yields more diverse, coherent, on‑topic text than greedy/beam or fixed top‑k.  \n   • It also outperforms simple temperature scaling (raising logits to 1/T) because it adapts to changes in the distribution’s shape.\n\n5. Takeaways for Practitioners  \n   • Don’t default to beam search for open-ended generation—its high likelihood doesn’t mean high quality.  \n   • Use nucleus sampling (p between 0.8 and 0.95) for a balance of diversity and coherence.  \n   • Monitor repetition and distinct‑n scores if you need automatic sanity checks.\n\nIn short, “neural text degeneration” is the tendency of likelihood‐maximizing decoders to produce dull or looping text. By recognizing that the shape of the model’s probability distribution varies wildly from step to step, nucleus sampling provides an elegant, adaptive way to maintain both coherence and diversity in generated text.",
          "type": "output_text"
        }
      ],
      "role": "assistant",
      "status": "completed",
      "type": "message"
    }
  ],
  "parallel_tool_calls": true,
  "temperature": 1.0,
  "tool_choice": "auto",
  "tools": [],
  "top_p": 1.0,
  "max_output_tokens": null,
  "previous_response_id": null,
  "reasoning": {
    "effort": "medium",
    "generate_summary": null,
    "summary": "detailed"
  },
  "status": "completed",
  "text": {
    "format": {
      "type": "text"
    }
  },
  "truncation": "disabled",
  "usage": {
    "input_tokens": 16,
    "output_tokens": 974,
    "output_tokens_details": {
      "reasoning_tokens": 384
    },
    "total_tokens": 990,
    "input_tokens_details": {
      "cached_tokens": 0
    }
  },
  "user": null,
  "store": true
}

Markdown 输出

默认情况下,o3-minio1 模型不会尝试生成包含 markdown 格式的输出。 一个常见的使用场景是,当希望模型输出包含在 markdown 代码块中的代码时,这种行为是不理想的。 当模型生成不带 markdown 格式的输出时,会在交互式操场体验中丢失语法突出显示和可复制代码块等功能。 若要替代此新的默认行为并鼓励在模型响应中包含 Markdown,请将字符串 Formatting re-enabled 添加到开发人员消息的开头。

Formatting re-enabled 添加到开发人员消息的开头不能保证模型在其响应中包含 markdown 格式,只会增加其可能性。 我们在内部测试中发现,具有 Formatting re-enabled 模型的 o1 与具有 o3-mini 相比,其本身效率更低。

为了提高 Formatting re-enabled 的性能,可以进一步增强开发人员信息的开头部分,这通常会产生所需的输出结果。 可以尝试添加更具描述性的初始说明,而不是将 Formatting re-enabled 添加到开发人员消息的开头,如以下示例之一:

  • Formatting re-enabled - please enclose code blocks with appropriate markdown tags.
  • Formatting re-enabled - code output should be wrapped in markdown.

根据预期输出,可能需要进一步自定义初始开发人员消息,以针对特定用例。