Using the Amazon S3 Tables Catalog for Apache Iceberg with Apache Spark

Accessing Amazon S3 tables with the Amazon S3 Tables Catalog for Apache Iceberg

You can access S3 tables from open source query engines like Apache Spark by using the Amazon S3 Tables Catalog for Apache Iceberg client catalog. Amazon S3 Tables Catalog for Apache Iceberg is an open source library hosted by AWS Labs. It works by translating Apache Iceberg operations in your query engines (such as table discovery, metadata updates, and adding or removing tables) into S3 Tables API operations.

Amazon S3 Tables Catalog for Apache Iceberg is distributed as a Maven JAR called s3-tables-catalog-for-iceberg.jar. You can build the client catalog JAR from the AWS Labs GitHub repository or download it from Maven. When connecting to tables, the client catalog JAR is used as a dependency when you initialize a Spark session for Apache Iceberg.

Using the Amazon S3 Tables Catalog for Apache Iceberg with Apache Spark

You can use the Amazon S3 Tables Catalog for Apache Iceberg client catalog to connect to tables from open-source applications when you initialize a Spark session. In your session configuration you specify Iceberg and Amazon S3 dependencies, and create a custom catalog that uses your table bucket as the metadata warehouse.

Prerequisites

An IAM identity with access to your table bucket and S3 Tables actions. For more information, see Access management for S3 Tables.

To initialize a Spark session using the Amazon S3 Tables Catalog for Apache Iceberg

Initialize Spark using the following command. To use the command, replace the replace the Amazon S3 Tables Catalog for Apache Iceberg version number with the latest version from AWS Labs GitHub repository, and the table bucket ARN with your own table bucket ARN.


spark-shell \
--packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.6.1,software.amazon.s3tables:s3-tables-catalog-for-iceberg-runtime:0.1.4 \
--conf spark.sql.catalog.s3tablesbucket=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.s3tablesbucket.catalog-impl=software.amazon.s3tables.iceberg.S3TablesCatalog \
--conf spark.sql.catalog.s3tablesbucket.warehouse=arn:aws:s3tables:us-east-1:111122223333:bucket/amzn-s3-demo-table-bucket \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions

Querying S3 tables with Spark SQL

Using Spark, you can run DQL, DML, and DDL operations on S3 tables. When you query tables you use the fully qualified table name, including the session catalog name following this pattern:

CatalogName.NamespaceName.TableName

The following example queries show some ways you can interact with S3 tables. To use these example queries in your query engine, replace the user input placeholder values with your own.

To query tables with Spark

Create a namespace


spark.sql(" CREATE NAMESPACE IF NOT EXISTS s3tablesbucket.my_namespace")

Create a table


spark.sql(" CREATE TABLE IF NOT EXISTS s3tablesbucket.my_namespace.`my_table` 
( id INT, name STRING, value INT ) USING iceberg ")

Query a table


spark.sql(" SELECT * FROM s3tablesbucket.my_namespace.`my_table` ").show()

Insert data into a table


spark.sql(
"""
    INSERT INTO s3tablesbucket.my_namespace.my_table 
    VALUES 
        (1, 'ABC', 100), 
        (2, 'XYZ', 200)
""")

Load an existing data file into a table

Read the data into Spark.


val data_file_location = "Path such as S3 URI to data file"
val data_file = spark.read.parquet(data_file_location)

Write the data into an Iceberg table.


data_file.writeTo("s3tablesbucket.my_namespace.my_table").using("Iceberg").tableProperty ("format-version", "2").createOrReplace()

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Accessing tables using the Amazon S3 Tables Iceberg REST endpoint

Amazon Athena