你当前正在访问 Microsoft Azure Global Edition 技术文档网站。 如果需要访问由世纪互联运营的 Microsoft Azure 中国技术文档网站,请访问 https://docs.azure.cn。
Note
This feature is currently in public preview. This preview is provided without a service-level agreement, and we don't recommend it for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.
Azure OpenAI GPT-4o Realtime API for speech and audio is part of the GPT-4o model family that supports low-latency, "speech in, speech out" conversational interactions.
You can use the Realtime API via WebRTC or WebSocket to send audio input to the model and receive audio responses in real time.
Follow the instructions in this article to get started with the Realtime API via WebSockets. Use the Realtime API via WebSockets in server-to-server scenarios where low latency isn't a requirement.
Tip
In most cases, we recommend using the Realtime API via WebRTC for real-time audio streaming in client-side applications such as a web application or mobile app. WebRTC is designed for low-latency, real-time audio streaming and is the best choice for most use cases.
Supported models
The GPT-4o real-time models are available for global deployments in East US 2 and Sweden Central regions.
gpt-4o-mini-realtime-preview
(2024-12-17)gpt-4o-realtime-preview
(2024-12-17)
You should use API version 2025-04-01-preview
in the URL for the Realtime API.
For more information about supported models, see the models and versions documentation.
Prerequisites
Before you can use GPT-4o real-time audio, you need:
- An Azure subscription - Create one for free.
- An Azure OpenAI resource created in a supported region. For more information, see Create a resource and deploy a model with Azure OpenAI.
- You need a deployment of the
gpt-4o-realtime-preview
orgpt-4o-mini-realtime-preview
model in a supported region as described in the supported models section. You can deploy the model from the Azure AI Foundry portal model catalog or from your project in Azure AI Foundry portal.
Connection and authentication
The Realtime API (via /realtime
) is built on the WebSockets API to facilitate fully asynchronous streaming communication between the end user and model.
The Realtime API is accessed via a secure WebSocket connection to the /realtime
endpoint of your Azure OpenAI resource.
You can construct a full request URI by concatenating:
- The secure WebSocket (
wss://
) protocol. - Your Azure OpenAI resource endpoint hostname, for example,
my-aoai-resource.openai.azure.com
- The
openai/realtime
API path. - An
api-version
query string parameter for a supported API version such as2024-12-17
- A
deployment
query string parameter with the name of yourgpt-4o-realtime-preview
orgpt-4o-mini-realtime-preview
model deployment.
The following example is a well-constructed /realtime
request URI:
wss://my-eastus2-openai-resource.openai.azure.com/openai/realtime?api-version=2025-04-01-preview&deployment=gpt-4o-mini-realtime-preview-deployment-name
To authenticate:
- Microsoft Entra (recommended): Use token-based authentication with the
/realtime
API for an Azure OpenAI resource with managed identity enabled. Apply a retrieved authentication token using aBearer
token with theAuthorization
header. - API key: An
api-key
can be provided in one of two ways:- Using an
api-key
connection header on the prehandshake connection. This option isn't available in a browser environment. - Using an
api-key
query string parameter on the request URI. Query string parameters are encrypted when using https/wss.
- Using an
Realtime API via WebSockets architecture
Once the WebSocket connection session to /realtime
is established and authenticated, the functional interaction takes place via events for sending and receiving WebSocket messages. These events each take the form of a JSON object.
Events can be sent and received in parallel and applications should generally handle them both concurrently and asynchronously.
- A client-side caller establishes a connection to
/realtime
, which starts a newsession
. - A
session
automatically creates a defaultconversation
. Multiple concurrent conversations aren't supported. - The
conversation
accumulates input signals until aresponse
is started, either via a direct event by the caller or automatically by voice activity detection (VAD). - Each
response
consists of one or moreitems
, which can encapsulate messages, function calls, and other information. - Each message
item
hascontent_part
, allowing multiple modalities (text and audio) to be represented across a single item. - The
session
manages configuration of caller input handling (for example, user audio) and common output generation handling. - Each caller-initiated
response.create
can override some of the outputresponse
behavior, if desired. - Server-created
item
and thecontent_part
in messages can be populated asynchronously and in parallel. For example, receiving audio, text, and function information concurrently in a round robin fashion.
Try the quickstart
Now that you have the prerequisites, you can follow the instructions in the Realtime API quickstart to get started with the Realtime API via WebSockets.
Related content
- Try the real-time audio quickstart
- See the Realtime API reference
- Learn more about Azure OpenAI quotas and limits