Parquet

  • Capella Columnar
  • reference
    +
    This topic explains how to use the COPY TO statement to export data from a database to Amazon S3 or Google Cloud Storage (GCS) in Parquet format.

    Having results in Parquet format is highly efficient due to its Columnar storage, which enables better compression and faster query performance.

    Parquet files takes less storage space, making it cost-effective for large-scale data storage and analytics.

    The optimized format is for read-heavy workloads making it ideal for big data processing, data warehousing, and machine learning pipelines.

    To be able to read or write data to or from external cloud storage, exclusive permissions are required. For more information see Cloud Read/Write Permissions.

    Supported Methods

    You can copy data using one of the two methods:

    1. User-Defined Schema: It’s a schema explicitly provided by the user in the COPY TO statement when the structure of the result or collection is known.

    2. Schema Inference: The system infers the schema from the data.

    Syntax

    CopyTo EBNF
    CopyTo ::= "COPY" SourceDefinition
               "TO" ContainerIdentifier
               "AT" LinkQualifiedName
               OutputClause WithClause
    CopyTo Diagram
    "COPY" SourceDefinition "TO" ContainerIdentifier "AT" LinkName OutputClause WithClause
    Show SourceDefinition Diagram
    ( QualifiedName | "(" Query ")" ) ("AS"? AliasIdentifier )?
    SourceDefinition
    Show OutputClause Diagram
    "PATH" "(" OutputPathExpr ("," OutputPathExpr)* ")" OverClause? SchemaClause?
    OutputClause
    Show OutputPathExpr Diagram
    Expr
    OutputPathExpr
    Show OverClause Diagram
    "OVER" "(" PartitionClause? OrderClause? ")"
    OverClause
    Show SchemaClause Diagram
    TYPE
    SchemaClause
    Show TypeExpression Diagram
    ObjectTypeDef
    TypeExpression
    Show ObjectTypeDefinition Diagram
    {
    ObjectTypeDefinition
    Show ArrayTypeDefinition Diagram
    (
    ArrayTypeDefinition
    Show ObjectField Diagram
    Identifier
    ObjectField
    Show PartitionClause Diagram
    "PARTITION" "BY" PartitionExpr ("," PartitionExpr)*
    PartitionClause
    Show PartitionExpr Diagram
    Expression ("AS" AliasIdentifier)?
    PartitionExpr
    Show OrderClause Diagram
    "ORDER" "BY"  OrderExpr ("," OrderExpr)*
    OrderClause
    Show OrderExpr Diagram
    Expression ("ASC" | "DESC")? ( "NULLS" ( "FIRST" | "LAST" ) )?
    OrderExpr
    Show WithClause Diagram
    "WITH" "{" NameValuePair ("," NameValuePair )* "}"
    WithClause
    Show NameValuePair Diagram
    NameStringLiteral ":" ValueLiteral
    NameValuePair

    Examples

    You can copy the Customer Data from Example Data in parquet format with gzip compression.

    Example 1: Syntax for User-Defined Schema

      COPY (
        SELECT * from customers
      ) AS t
      TO myS3Bucket AT myS3Link
        PATH("commerce/Customers/zip=" || zip || "/")
        OVER(PARTITION BY t.address.zipcode AS zip)
      TYPE (
    {
     orderno : int ,
     custid : string,
     order_date : date,
     ship_date : date,
     items : [ { itemno:int, qty : int, price : double } ]
    }
      )
      WITH {
        "format": "parquet"  ,
        "compression": "gzip"
      }

    Example 2: Syntax for Schema Inference

      COPY (
        SELECT * from customers
      ) AS t
      TO myS3Bucket AT myS3Link
        PATH("commerce/Customers/zip=" || zip || "/")
        OVER(PARTITION BY t.address.zipcode AS zip)
      WITH {
        "format": "parquet",
        "compression": "gzip"
      }

    Arguments

    SourceDefinition

    As the source, you specify either the fully qualified name of a collection or provide a query.

    • If you specify a collection name, then the whole collection—or view or synonym—is the source of data to copy.

    • If you specify a query, then the result of that query is the source of data.

    TO

    The TO clause identifies the bucket name on the external data source, an Amazon S3 bucket in this case.

    AT

    The AT clause specifies the name of the link that contains credentials for the S3 bucket name. The specified link must have a type of S3.

    OutputClause

    The OutputClause defines the destination path for the output objects. You supply one or more OutputPathExpr expressions to identify the path prefixes. You can include aliases. If you supply more than one expression, Capella Columnar concatenates the values of all OutputPathExpr and supplies / characters as the path separators. As a result, you do not need to include slash characters between OutputPathExpr expressions.

    The target directory that you specify in the destination path must be empty. The operation fails if the target directory is not empty.
    OverClause

    You supply an optional OverClause to specify output partitioning with a PartitionClause and ordering with an OrderClause. This is similar to the OVER clause of a WINDOW statement.

    • If you specify a PartitionClause, Capella Columnar evaluates the Output_Path_Expr once per logical data partition and refers to aliases if defined by an AS sub-clause.

    • If you do not specify a PartitionClause, Capella Columnar evaluates the OutputPathExpr once for the whole COPY TO output dataset. That’s, all of the files end up in the same directory.

    You use the OrderClause to define output object order, either within each partition or for the whole dataset.

    SchemaClause

    The SchemaClause defines the schema for the output Parquet files. You specify the schema using a JSON-like format: { field-name1: type1, field-name2: type2, …​ }. The types can be flat types, array types, or object types. Supported flat types are listed in the Supported Types section.

    WITH

    You use the WITH clause to specify the following additional parameters.

    Name Description Schema

    format (optional)

    Allowed values: "parquet"

    Default: JSON

    String Enum

    max-objects-per-file

    Maximum number of objects per file.

    Default: 10000.

    int

    max-schemas(optional)

    Maximum number of heterogeneous schemas allowed.

    This value cannot be greater than 10.

    This value is valid only for schemaless mode (with TYPE not provided).

    Default : 5

    int

    compression (Optional)

    Allowed values: "none", "snappy", "gzip","lzo","brotli","lz4","zstd"

    Default: none

    String enum

    row-group-size(Optional)

    Row Group Size in parquet file in byte values.

    Default: 10 MB

    String

    page-size(Optional)

    Page Size in parquet file in byte values.

    Default: 8 KB

    String

    version(Optional)

    Parquet Writer Version.

    Allowed values : 1,2

    Default: 1

    String Enum

    Supported Data Types

    The supported types and the corresponding parquet types are in the following table:

    Analytics Type Parquet Type

    boolean

    BOOLEAN

    string

    BINARY(STRING)

    tinyint, smallint, int

    INT32

    bigint

    INT64

    float

    FLOAT

    double

    DOUBLE

    date

    INT32(DATE)

    time

    INT32(TIME_MILLIS)

    datetime

    INT64(TIMESTAMP_MILLIS)

    UUID

    FIXED_LEN_BYTE_ARRAY(UUID)