Edit

Share via


Open mirroring landing zone requirements and format

This article details the landing zone and table/column operation requirements for open mirroring in Microsoft Fabric.

Once you have created your open mirrored database via the Fabric portal or public API in your Fabric workspace, you get a landing zone URL in OneLake in the Home page of your mirrored database item. This landing zone is where your application to create a metadata file and land data in Parquet or delimited text format, including CSV. Files can be uncompressed or compressed with Snappy, GZIP, or ZSTD. For more information, see supported data files and format.

Screenshot from the Fabric portal showing the Landing zone URL ___location in the Home page of the mirrored database item.

Landing zone

For every mirrored database, there is a unique storage ___location in OneLake for metadata and delta tables. Open mirroring provides a landing zone folder for application to create a metadata file and push data into OneLake. Mirroring monitors these files in the landing zone and read the folder for new tables and data added.

For example, if you have tables (Table A, Table B, Table C) to be created in the landing zone, create folders like the following URLs:

  • https://onelake.dfs.fabric.microsoft.com/<workspace id>/<mirrored database id>/Files/LandingZone/TableA
  • https://onelake.dfs.fabric.microsoft.com/<workspace id>/<mirrored database id>/Files/LandingZone/TableB
  • https://onelake.dfs.fabric.microsoft.com/<workspace id>/<mirrored database id>/Files/LandingZone/TableC

Metadata file in the landing zone

Every table folder must contain a _metadata.json file.

This table metadata file contains a JSON record to currently specify only the unique key columns as keyColumns.

For example, to declare columns C1 and C2 as a compound unique key for the table:

{
   "keyColumns" : ["C1", "C2"]
}

If keyColumns or _metadata.json is not specified, then update/deletes are not possible. This file can be added anytime, but once added keyColumns can't be changed.

Events file in the landing zone

If you are a partner implementing an open mirroring solution or a customer who'd like to provide additional details to us about the type of source you're mirroring into OneLake, we've added a new _partnerEvents.json file. This is not required but strongly recommended.

Example:

{
  "partnerName": "testPartner",
  "sourceInfo": {
    "sourceType": "SQL",
    "sourceVersion": "2019",
    "additionalInformation": {
      "testKey": "testValue"
    }
  }
}

Requirements of the _partnerEvents.json file:

  • The _partnerEvents.json file should be placed at the mirrored database level in the landing zone, not per table.
  • The sourceType can be any descriptive string representing the source. There are no constraints on this value, for example: "SQL", "Oracle", "Salesforce", etc.
  • The partnerName can be set to any name of your choosing and can be representative of your organization's name. Keep the name consistent across all mirror databases.

Data file and format in the landing zone

Open mirroring supports data intake in Parquet or delimited text formats. Files can be uncompressed or compressed with Snappy, GZIP, or ZSTD.

Parquet requirements

Delimited text requirements

  1. If the data is in a delimited text format, the file must have header row in the first row.

  2. For delimited text, provide additional information in your _metadata.json file. The FileExtension property is required. Delimited text files have the following properties and defaults:

    Property Description Notes
    FirstRowAsHeader True/false for first row header. Required to be true for delimited text files.
    RowSeparator Character used to separate rows. Default is \r\n. Also supports \n and \r.
    ColumnSeparator Character used to separate columns. Default is ,. Also supports ;, |, and \t.
    QuoteCharacter Character used to quote values containing delimiters. Default is ". Can also be ' or empty string.
    EscapeCharacter Used to escape quotes inside quoted values. Default is \. Can also be /, ", or empty.
    NullValue String representation of null values. Can be "", "N/A", "null", etc.
    Encoding Character encoding of the file. Default is UTF-8. Supports a wide range of encodings including ascii, utf-16, windows-1252, etc.
    SchemaDefinition Defines column names, types, and nullability. Schema evolution is not supported.
    FileFormat Format of the data file. Defaults to CSV if not specified. Must be "DelimitedText" for formats other than CSV.
    FileExtension Specifies file extension like .tsv, .psv. Required when using DelimitedText.

    For example, the _metadata.json file for a .tsv data file with four columns:

    {
    "KeyColumns": [ "id" ],
    "ConditionalUpdateColumn": "seqNum",
    "SchemaDefinition": {
                 "Columns": [
                             {
                             "Name": "id",
                             "DataType": "Int32"
                             },
                             {
                             "Name": "name",
                             "DataType": "String",
                             "IsNullable": true
                             },
                             {
                             "Name": "age",
                             "DataType": "Int32",
                             "IsNullable": true
                             },
                             {
                             "Name": "seqNum",
                             "DataType": "Int64",
                             "IsNullable": false
                             }
                            ]
                 },
    "FileFormat": "DelimitedText",
    "FileExtension": "tsv",
    "FileFormatTypeProperties": {
                               "FirstRowAsHeader": true,
                               "RowSeparator": "\r\n",
                               "ColumnSeparator": ",",
                               "QuoteCharacter": "'",
                               "EscapeCharacter": "\",
                               "NullValue": "N/A",
                               "Encoding": "UTF-8"
                            }
    }
    

Format requirements

All files written to the landing zone have the following format:

<rowMarker><DataColumns>

  • rowMarker: column name is __rowMarker__ (including two underscores before and after rowMarker). __rowMarker__ values and behaviors:

    __rowMarker__ (Scenario) If row doesn't exist with same key column(s) in the destination If row exists with same key column(s) in the destination
    0 (Insert) Insert the row to destination Insert the row to destination, no validation for dup key column check.
    1 (Update) Insert the row to destination, no validation/exception to check existence of row with same key column. Update the row with same key column.
    2 (Delete) No data change, no validation/exception to check existence of row with same key column. Delete the row with same key column.
    4 (Upsert) Insert the row to destination, no validation/exception to check existence of row with same key column. Update the row with same key column.
  • Row order: All the logs in the file should be in natural order as applied in transaction. This is important for the same row being updated multiple times. Open mirroring applies the changes using the order in the files.

  • File order: Files should be added in monotonically increasing numbers.

  • File name: File name is 20 digits, like 00000000000000000001.parquet for the first file, and 00000000000000000002.parquet for the second. File names should be in continuous numbers. Files will be deleted by the mirroring service automatically, but the last file will be left so that the publisher system can reference it to add the next file in sequence.

Important

The __rowMarker__ column needs to be the final column in the list

Initial load

For the initial load of data into an open mirrored database, __rowMarker__ in the initial data file is optional and not recommended. Mirroring treats the entire file as an INSERT when __rowMarker__ doesn't exist.

For better performance and accurate metrics, __rowMarker__ is a mandatory field only for incremental changes to apply update/delete/upsert operation.

Incremental changes

Open mirroring reads incremental changes in order and applies them to the target Delta table. Order is implicit in the change log and in the order of the files.

Data changes are considered as incremental changes once the __rowMarker__ column is found from any row/file.

Updated rows must contain the full row data, with all columns.

Here is some sample parquet data of the row history to change the EmployeeLocation for EmployeeID E0001 from Redmond to Bellevue. In this scenario, the EmployeeID column has been marked as a key column in the metadata file in the landing zone.

EmployeeID,EmployeeLocation,__rowMarker__
E0001,Redmond,0
E0002,Redmond,0
E0003,Redmond,0
E0001,Bellevue,1

If key columns are updated, then it should be presented by a DELETE on previous key columns and an INSERT rows with new key and data. For example, the row history to change the __rowMarker__ unique identifier for EmployeeID E0001 to E0002. You don't need to provide all column data for a DELETE row, only the key columns.

EmployeeID,EmployeeLocation,__rowMarker__
E0001,Bellevue,0
E0001,NULL,2
E0002,Bellevue,0

Table operations

Open mirroring supports table operations such as add, drop, and rename tables.

Add table

Open mirroring picks up any table added to landing zone by the application. Open mirroring scans for new tables in every iteration.

Drop table

Open mirroring keeps track of the folder name. If a table folder is deleted, open mirroring drops the table in the mirrored database.

If a folder is recreated, open mirroring drops the table and recreates it with the new data in the folder, accomplished by tracking the ETag for the folder.

When attempting to drop a table, you can try deleting the folder, but there is a chance that open mirroring is still using the data from the folder, causing a delete failure for publisher.

Rename table

To rename a table, drop and recreate the folder with initial and incremental data. Data will need to be repopulated to the renamed table.

Schema

A table path can be specified within a schema folder. A schema landing zone should have a <schemaname>.schema folder name. There can be multiple schemas and there can be multiple tables in a schema.

For example, if you have schemas (Schema1, Schema2) and tables (Table A, Table B, Table C) to be created in the landing zone, create folders like the following paths in OneLake:

  • https://onelake.dfs.fabric.microsoft.com/<workspace id>/<mirrored database id>/Files/LandingZone/Schema1.schema/TableA
  • https://onelake.dfs.fabric.microsoft.com/<workspace id>/<mirrored database id>/Files/LandingZone/Schema1.schema/TableB
  • https://onelake.dfs.fabric.microsoft.com/<workspace id>/<mirrored database id>/Files/LandingZone/Schema2.schema/TableC

Table columns and column operations

Column types

  • Simple parquet types are supported in the landing zone.
  • Complex types should be written as a JSON string.
  • Binary complex types like geography, images, etc. can be stored as binary type in the landing zone.

Add column

If new columns are added to the parquet or CSV files, open mirroring adds the columns to the delta tables.

Delete column

If a column is dropped from the new log files, open mirroring stores NULL for those columns in new rows, and old rows have the columns present in the data. To delete the column, drop the table and create the table folder in the landing zone again, which will result into recreation of the Delta table with new schema and data.

Open mirroring always unions all the columns from previous version of added data. To remove a column, recreate the table/folder.

Change column type

To change a column type, drop and recreate the folder with initial and incremental data with the new column type. Providing a new column type without recreating the table results an error, and replication for that table will stop. Once the table folder is recreated, replication resumes with new data and schema.

Rename column

To rename a column, delete the table folder and recreate the folder with all the data and with the new column name.

Cleanup process

A cleanup process for opening mirroring moves all processed files to a separate folder called _ProcessedFiles or _FilesReadyToDelete. After seven days, the files are removed from this folder.

Next step