你当前正在访问 Microsoft Azure Global Edition 技术文档网站。 如果需要访问由世纪互联运营的 Microsoft Azure 中国技术文档网站,请访问 https://docs.azure.cn。
APPLIES TO:
Azure Database for PostgreSQL - Flexible Server
Mirroring in Fabric provides an easy experience to avoid complex ETL (Extract Transform Load) and integrate your existing Azure Database for PostgreSQL flexible server estate with the rest of your data in Microsoft Fabric. You can continuously replicate your existing Azure Database for PostgreSQL flexible server directly into Fabric OneLake. Inside Fabric, you can unlock powerful business intelligence, artificial intelligence, Data Engineering, Data Science, and data sharing scenarios.
Architecture
Fabric mirroring in Azure Database for PostgreSQL flexible server is built on concepts like logical replication and Change Data Capture (CDC) design pattern.
Once Fabric mirroring is established for a database in Azure Database for a PostgreSQL flexible server, a PostgreSQL background process creates an initial snapshot for selected tables to be mirrored, which is shipped to a Fabric OneLake landing zone in Parquet format. A Replicator process running in Fabric takes these initial snapshot files and creates Delta tables in the Mirrored database artifact.
Subsequent changes applied to selected tables are also captured in the source database and shipped to the OneLake landing zone in batches to be applied to the respective Delta tables in the Mirrored database artifact.
What is Change Data Capture (CDC)?
Change Data Capture (CDC) is a method that enables applications to detect and capture changes made to a database.
It doesn't rely on explicit SQL queries to track changes.
Instead, it involves a continuous stream of change events published by the database server.
Clients can subscribe to this stream to monitor changes, focusing on specific databases, individual tables, or even subsets of columns within a table.
For Fabric mirroring, the CDC pattern is implemented in a proprietary PostgreSQL extension called azure_cdc. The Azure Database for PostgreSQL flexible server control plane is installed and registered in source databases during the Fabric mirroring enablement workflow.
Azure Change Data Capture (CDC) extension
Azure CDC is an extension for PostgreSQL that enhances the capabilities of logical decoding.
It interprets and transforms Write-Ahead Log (WAL) data into an understandable logical format.
The extension converts database modifications into a sequence of logical operations like INSERT, UPDATE, and DELETE.
Azure CDC is a layer on top of PostgreSQL's built-in logical decoding plugin, pgoutput
.
Azure CDC exports table snapshots and modifications as Parquet files and copies them to a Fabric OneLake landing zone for subsequent processing.
Enable Fabric mirroring in the Azure portal
Fabric mirroring in the Azure portal for Azure Database for PostgreSQL flexible server allows you to replicate your PostgreSQL databases into Microsoft Fabric. This feature helps you integrate your data seamlessly with other services in Microsoft Fabric, enabling advanced analytics, business intelligence, and data science scenarios. By following a few simple steps in the Azure portal, you can configure the necessary prerequisites and start mirroring your databases to use the full potential of Microsoft Fabric.
Prerequisites
Several prerequisites must be configured before using Fabric mirroring in Azure Database for the PostgreSQL flexible server.
System-assigned Managed Identity (SAMI) must be enabled.\
- This is the identity used by the Azure CDC to authenticate communications with Fabric OneLake, copy initial snapshots, and change batches to the landing zone.
wal_level server parameter must be set to "logical".
- Enables logical replication for the source server.
The Azure CDC extension (azure_cdc) is preloaded on the source server and registered for selected databases to mirror (it requires restart).
max_worker_processes server parameter must be increased to accommodate more background processes for mirroring.
A new page is available in the Azure portal to automate prerequisite configuration on the source server.
Select Get Started to initiate the enablement workflow.
This page shows the current status of the required prerequisites. If System Assigned Managed Identity (SAMI) isn't enabled for this server, select the link to be redirected to the page where you can enable this feature.
Once done, you can select the databases to enable Fabric mirroring (up to 3 by default, but this can be increased by changing the max_mirrored_databases server parameter) and then select Prepare.
The workflow presents a Restart Server pop-up, and by selecting Restart, you can start the process, which automates all remaining configuration steps, and you can start creating your mirrored database from Fabric user interface
Server parameters
These server parameters directly affect Fabric mirroring for Azure Database for PostgreSQL.
Azure.fabric_mirror_enabled: The default is off. This parameter specifies the flag indicating whether mirroring is enabled on the server. It's set automatically at the end of the server enablement workflow, so you shouldn't change it manually.
max_replication_slots: Default 10. We consume one replication slot per mirrored database, but customers might consider increasing this if they create more mirrors or have other replication slots created for other purposes (logical replication).
max_wal_senders: The default is 10. As with the previous parameter, we use one
wal
sender process per mirror, which should be increased when mirroring more databases.max_worker_processes: The default is 8. After the initial snapshot, we use one process per mirrored database or where mirroring is enabled (but no mirrored artifact has been created in Fabric yet). You must increase this value if you have other extensions or workloads using more worker processes.
max_parallel_workers: The default is 8, which limits the number of workers that can run simultaneously. If you enable multiple mirroring sessions on the same server, you might consider increasing this parameter to allow more parallel operations (for example, increasing parallelism in initial snapshots).
azure_cdc.max_fabric_mirrors Default is 3. Customers can increase this value if they need to mirror more than three databases on this server. It's important to consider that every new mirrored database consumes server resources (five background processes using CPU and Memory resources for snapshot creation and change batching), so depending on how busy your server is, you should monitor resource utilization and scale up your compute size to the next size available if CPU and memory utilization are constantly above 80% or performance aren't what you expect.
azure_cdc.max_snapshot_workers: Default is 3. Maximum number of worker processes used during initial snapshot creation. Increase this to speed up initial snapshot creation when increasing the number of mirrored databases. However, you should consider all the other background processes running in the system before doing that.
azure_cdc.change_batch_buffer_size: Default is 16 MB. Maximum buffer size (in MB) for change batch. The table shows much data is buffered up to this before being written to the local disk. Depending on the data change frequency on your mirrored databases, you could tweak this value to reduce the change batch frequency or increase it if you want to prioritize overall throughput.
azure_cdc.change_batch_export_timeout: Default is 30. Maximum idle time (in seconds) between change batch messages. When exceeded, we mark the current batch as complete. Depending on the data change frequency on your mirrored databases, you could tweak this value to reduce the change batch frequency or increase it if you want to prioritize overall throughput.
azure_cdc.parquet_compression: Default is ZSTD. This parameter is for internal use only, so you shouldn't modify it.
azure_cdc.snapshot_buffer_size: Defaults to 1000. The maximum size (in MB) of the initial snapshot buffer. Per the table, much data is buffered up to this before being sent to Fabric. Remember that azure_cdc.snapshot_buffer_size*azure_cdc.max_snapshot_workers is the total memory buffer used during the initial snapshot.
azure_cdc.snapshot_export_timeout: Defaults to 180. Maximum time (in minutes) to export initial snapshot. If the maximum time is exceeded, then it restarts.
Monitor
Monitoring Fabric mirroring in Azure Database for PostgreSQL flexible server is essential to ensure that the mirroring process is running smoothly and efficiently. By monitoring the status of the mirrored databases, you can identify any potential issues and take corrective actions as needed.
You can use several User Defined Functions and tables to monitor important CDC metrics in the Azure Database for the PostgreSQL flexible server and troubleshoot the Mirroring process to Fabric.
Monitoring functions
The mirroring function for Fabric mirroring in Azure Database for PostgreSQL allows you to replicate your PostgreSQL databases into Microsoft Fabric seamlessly, enabling advanced analytics and data integration scenarios.
azure_cdc.list_tracked_publications(): for each publication in the source flexible server, returns a comma-separated string containing the following information - publicationName (text) - includeData (bool) - includeChanges (bool) - active (bool) - baseSnapshotDone (bool) - generationId (int)
azure_cdc.publication_status('pub_name'): for each publication in the source, the flexible server returns a comma-separated string with the following information
- <status, start_lsn, stop_lsn, flush_lsn>.
- Status consists of ["Slot name," "Origin name," "CDC data destination path," "Active," "Snapshot Done," "Progress percentage," "Generation ID," "Completed Batch ID," "Uploaded Batch ID," "CDC start time"]
azure_cdc.is_table_mirrorable('schema_name','table_name'): Given schema and table name, it returns if the table is mirrorable. For a table to be mirrorable, it needs to satisfy the following:
- The column names don't contain any of the following characters:
[ ;{}\n\t=()]
- The column types are one of the following:
bigint
bigserial
boolean
bytes
character
character varying
date
double precision
integer
numeric
real
serial
oid
money
smallint
smallserial
text
time without time zone
time with time zone
timestamp without time zone
timestamp with time zone
uuid
- The table isn't a view, materialized view, foreign table, toast table, or partitioned table
- The table has a primary key or a unique, non-null, and nonpartial index
- The column names don't contain any of the following characters:
Tracking tables
- azure_cdc.tracked_publications: one row for each existing Mirrored database in Fabric. Query this table to understand the status of each publication.
Column Name | Postgres Type | Explanation |
---|---|---|
publication_id | oid | Oid of the publication |
destination_path | text | Path to the landing zone in Fabric OneLake |
destination_format | azure_cdc.data_format | Format of the data in Azure CDC |
include_data | bool | Whether to include initial snapshot data in the publication |
include_changes | bool | Whether to include changes in the publication |
active | bool | Whether the publication is active |
snapshot_done | bool | Whether the snapshot is completed |
snapshot_progress | smallint | Progress of the snapshot |
snapshot_progress_percentage | text | Percentage progress of the snapshot |
generation_id | int | Generation identifier |
stream_start_lsn | pg_lsn | Log sequence number where the change stream started |
stream_start_time | timestamp | Timestamp when the change stream started |
stream_stop_lsn | pg_lsn | Log sequence number where the change stream stopped |
snapshot_size | bigint | Total size of the snapshot (in bytes) |
total_time | int | Total time(in seconds) taken for the publication |
- azure_cdc.tracked_batches: one row for each change batch captured and shipped to Fabric OneLake. Query this table to understand which batch has already been captured and uploaded to Fabric OneLake. With the
last_written_lsn
column, you can understand if a given transaction in your source database has already been shipped to Fabric.
Name | Postgres Type | Explanation |
---|---|---|
publication_id | oid | Oid of the publication |
completed_batch_id | bigint | Sequence number(starting from 1) of the batch. Unique per publication |
last_written_lsn | pg_lsn | LSN of the last write of this batch |
last_received_lsn | pg_lsn | Last LSN received |
server_lsn | pg_lsn | current server LSN (at the time when capturing of this batch finalized) |
is_batch_uploaded | bool | Whether the batch is uploaded |
is_batch_acknowledged | bool | Whether we acknowledged wal_sender for this batch data(last_written_lsn) |
batch_start_time | TIMESTAMPTZ | Timestamp of the batch start |
batch_completion_time | TIMESTAMPTZ | Timestamp of the batch completion |
batch_uploaded_time | TIMESTAMPTZ | Timestamp of the batch upload |
batch_acknowledged_time | TIMESTAMPTZ | Timestamp of the batch when LSN is ack`ed to the publisher |
batch_size | int | Size of the batch (in bytes) |