'Add partition projection to AWS Athena table using Cloudformation

I have an Athena table defined with a template specified like so in cloudformation:

Cloudformation Create

EventsTable:
  Type: AWS::Glue::Table
  Properties:
    CatalogId: !Ref AWS::AccountId
    DatabaseName: !Ref DatabaseName
    TableInput:
      Description: "My Table"
      Name: !Ref TableName
      TableType: EXTERNAL_TABLE
      StorageDescriptor:
        Compressed: True
        Columns:
          - Name: account_id
            Type: string
            Comment: "Account Id of the account making the request"
            ...
        InputFormat: org.apache.hadoop.mapred.TextInputFormat
        SerdeInfo:
          SerializationLibrary: org.openx.data.jsonserde.JsonSerDe
        OutputFormat: org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat
        Location: !Sub "s3://${EventsBucketName}/events/"


This works well and deploys. I also found out I can have partition projections created as per this doc and this doc

And can make that work with a direct table creation, roughly:

SQL Create

CREATE EXTERNAL TABLE `performance_data.events`
(
  `account_id`  string,
...
)
   PARTITIONED BY (
     `day` string)
    ROW FORMAT SERDE
        'org.openx.data.jsonserde.JsonSerDe'
    STORED AS INPUTFORMAT
        'org.apache.hadoop.mapred.TextInputFormat'
        OUTPUTFORMAT
          'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
    LOCATION
        's3://my-bucket/events/'
    TBLPROPERTIES (
        'has_encrypted_data' = 'false',
        'projection.enabled' = 'true',
        'projection.day.type' = 'date',
        'projection.day.format' = 'yyyy/MM/dd',
        'projection.day.range' = '2020/01/01,NOW',
        'projection.day.interval' = '1',
        'projection.day.interval.unit' = 'DAYS',
        'storage.location.template' = 's3://my-bucket/events/${day}/'
)

But I can't find the docs to convert into the cloud formation structure. So my question is, how can I achieve the partition projection shown in the SQL code in cloudformation?



Solution 1:[1]

I now have a working solution. The missing piece was really a missing parameter, here is the solution:


MyTableResource:
  Type: AWS::Glue::Table
  Properties:
    CatalogId: MyAccountId
    DatabaseName: MyDatabase
    TableInput:
      Description: "My Table"
      Name: mytable
      TableType: EXTERNAL_TABLE
      PartitionKeys:
        - Name: day
          Type: string
          Comment: Day partition
      Parameters:
        "projection.enabled": "true"
        "projection.day.type": "date"
        "projection.day.format": "yyyy/MM/dd"
        "projection.day.range": "2020/01/01,NOW"
        "projection.day.interval": "1"
        "projection.day.interval.unit": "DAYS"
        "storage.location.template":  "s3://my-bucket/events/${day}/"


      StorageDescriptor:
        Compressed: True
        Columns:
          ...

        InputFormat: org.apache.hadoop.mapred.TextInputFormat
        SerdeInfo:
          Parameters:
            serialization.format: '1'
          SerializationLibrary: org.openx.data.jsonserde.JsonSerDe
        OutputFormat: org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat
        Location: "s3://my-bucket/events/"

The key addition was:

serialization.format: '1'

This now completely works and one can do a query that using the partition as:


select * from mytable where day > '2022/05/03'

Solution 2:[2]

Referring to the CloudFormation reference for the Glue Table TableInput, you can specify PartitionKeys and Parameters. This is the equivalent of PARTITIONED BY and TBLPROPERTIES in the query.

EDIT

As an example, you can refer to this article. The sample below shows how to define the PartitionKeys and how to define a JSON for the Parameters. In your case, you just have to add the projection keys (such as projection.enabled) and values (true).

# Create an Amazon Glue table
  CFNTableFlights:
    # Creating the table waits for the database to be created
    DependsOn: CFNDatabaseFlights
    Type: AWS::Glue::Table
    Properties:
      CatalogId: !Ref AWS::AccountId
      DatabaseName: !Ref CFNDatabaseName
      TableInput:
        Name: !Ref CFNTableName1
        Description: Define the first few columns of the flights table
        TableType: EXTERNAL_TABLE
        Parameters: {
    "classification": "csv"
  }
#       ViewExpandedText: String
        PartitionKeys:
        # Data is partitioned by month
        - Name: mon
          Type: bigint
        StorageDescriptor:
          OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
          Columns:
          - Name: year
            Type: bigint
          - Name: quarter
            Type: bigint
          - Name: month
            Type: bigint
          - Name: day_of_month
            Type: bigint            
          InputFormat: org.apache.hadoop.mapred.TextInputFormat
          Location: s3://crawler-public-us-east-1/flight/2016/csv/
          SerdeInfo:
            Parameters:
              field.delim: ","
            SerializationLibrary: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 silverdagger
Solution 2