你当前正在访问 Microsoft Azure Global Edition 技术文档网站。 如果需要访问由世纪互联运营的 Microsoft Azure 中国技术文档网站,请访问 https://docs.azure.cn

How to use the GPT-4o Realtime API via WebSockets (Preview)

Note

This feature is currently in public preview. This preview is provided without a service-level agreement, and we don't recommend it for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.

Azure OpenAI GPT-4o Realtime API for speech and audio is part of the GPT-4o model family that supports low-latency, "speech in, speech out" conversational interactions.

You can use the Realtime API via WebRTC or WebSocket to send audio input to the model and receive audio responses in real time.

Follow the instructions in this article to get started with the Realtime API via WebSockets. Use the Realtime API via WebSockets in server-to-server scenarios where low latency isn't a requirement.

Tip

In most cases, we recommend using the Realtime API via WebRTC for real-time audio streaming in client-side applications such as a web application or mobile app. WebRTC is designed for low-latency, real-time audio streaming and is the best choice for most use cases.

Supported models

The GPT-4o real-time models are available for global deployments in East US 2 and Sweden Central regions.

  • gpt-4o-mini-realtime-preview (2024-12-17)
  • gpt-4o-realtime-preview (2024-12-17)

You should use API version 2025-04-01-preview in the URL for the Realtime API.

For more information about supported models, see the models and versions documentation.

Prerequisites

Before you can use GPT-4o real-time audio, you need:

Connection and authentication

The Realtime API (via /realtime) is built on the WebSockets API to facilitate fully asynchronous streaming communication between the end user and model.

The Realtime API is accessed via a secure WebSocket connection to the /realtime endpoint of your Azure OpenAI resource.

You can construct a full request URI by concatenating:

  • The secure WebSocket (wss://) protocol.
  • Your Azure OpenAI resource endpoint hostname, for example, my-aoai-resource.openai.azure.com
  • The openai/realtime API path.
  • An api-version query string parameter for a supported API version such as 2024-12-17
  • A deployment query string parameter with the name of your gpt-4o-realtime-preview or gpt-4o-mini-realtime-preview model deployment.

The following example is a well-constructed /realtime request URI:

wss://my-eastus2-openai-resource.openai.azure.com/openai/realtime?api-version=2025-04-01-preview&deployment=gpt-4o-mini-realtime-preview-deployment-name

To authenticate:

  • Microsoft Entra (recommended): Use token-based authentication with the /realtime API for an Azure OpenAI resource with managed identity enabled. Apply a retrieved authentication token using a Bearer token with the Authorization header.
  • API key: An api-key can be provided in one of two ways:
    • Using an api-key connection header on the prehandshake connection. This option isn't available in a browser environment.
    • Using an api-key query string parameter on the request URI. Query string parameters are encrypted when using https/wss.

Realtime API via WebSockets architecture

Once the WebSocket connection session to /realtime is established and authenticated, the functional interaction takes place via events for sending and receiving WebSocket messages. These events each take the form of a JSON object.

Diagram of the Realtime API authentication and connection sequence.

Events can be sent and received in parallel and applications should generally handle them both concurrently and asynchronously.

  • A client-side caller establishes a connection to /realtime, which starts a new session.
  • A session automatically creates a default conversation. Multiple concurrent conversations aren't supported.
  • The conversation accumulates input signals until a response is started, either via a direct event by the caller or automatically by voice activity detection (VAD).
  • Each response consists of one or more items, which can encapsulate messages, function calls, and other information.
  • Each message item has content_part, allowing multiple modalities (text and audio) to be represented across a single item.
  • The session manages configuration of caller input handling (for example, user audio) and common output generation handling.
  • Each caller-initiated response.create can override some of the output response behavior, if desired.
  • Server-created item and the content_part in messages can be populated asynchronously and in parallel. For example, receiving audio, text, and function information concurrently in a round robin fashion.

Try the quickstart

Now that you have the prerequisites, you can follow the instructions in the Realtime API quickstart to get started with the Realtime API via WebSockets.