'Copy and Merge files to another S3 bucket
I have a source bucket where small 5KB JSON files will be inserted every second. I want to use AWS Athena to query the files by using an AWS Glue Datasource and crawler. For better query performance AWS Athena recommends larger file sizes.
So I want to copy the files from the source bucket to bucket2 and merge them.
I am planning to use S3 events to put a message in AWS SQS for each file created, then a lambda will be invoked with a batch of x sqs messages, read the data in those files, combine and save them to the destination bucket. bucket2 then will be the source of the AWS Glue crawler.
Will this be the best approach or am I missing something?
Solution 1:[1]
Instead of receiving 5KB JSON file every second in Amazon S3, the best situation would be to receive this data via Amazon Kinesis Data Firehose, which can automatically combine data based on either size or time period. It would output fewer, larger files.
You could also achieve this with a slight change to your current setup:
- When a file is uploaded to S3, trigger an AWS Lambda function
- The Lambda function reads the file and send it to Amazon Kinesis Data Firehose
- Kinesis Firehose then batches the data by size or time
Alternatively, you could use Amazon Athena to read data from multiple S3 objects and output them into a new table that uses Snappy-compressed Parquet files. This file format is very efficient for querying. However, your issue is that the files are arriving every second so it is difficult to query the incoming files in batches (so you know which files have been loaded and which ones have not been loaded). A kludge could be a script that does the following:
- Create an external table in Athena that points to a batching directory (eg
batch/
) - Create an external table in Athena that points to the final data (eg
final/
) - Have incoming files come into
incoming/
- At regular intervals, trigger a Lambda function that will list the objects in
incoming/
, copy them tobatch/
and delete those source objects fromincoming/
(any objects that arrive during this copy process will be left for the next batch) - In Athena, run
INSERT INTO final SELECT * FROM batch
- Delete the contents of the
batch/
directory
This will append the data into the final
table in Athena, in a format that is good for querying.
However, the Kinesis Firehose option is simpler, even if you need to trigger Lambda to send the files to the Firehose.
Solution 2:[2]
You can probably achive that using glue itself. Have a look here https://github.com/aws-samples/aws-glue-samples/blob/master/examples/join_and_relationalize.md
Solution 3:[3]
This is what I think will be more simpler
- Have input folder
input/
let 5kb/ 1kb files land here;/data
we will use this to have Json files with max size of 200MB. - Have a lambda that runs every 1minute which reads a set of files from
input/
and appends to the last file in the folder/data
using golang/ java. - The lambda (with max concurrency as 1) copies a set of 5kb files from
input/
and theXMB
files fromdata/
folder into its/tmp
folder; and merge them and then upload the merged file to/data
and also delte the files frominput/
folder - When ever the file size crosses
200MB
create a new file intodata/
folder
The advantage here is at any instant if somebody wants data its the union of input/
and data/
folder or in other words
With little tweeks here and there you can expose a view on top of input
and data
folders which can expose final de-duplicated snapshot of the final data.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | John Rotenstein |
Solution 2 | Gaurav Lanjekar |
Solution 3 | chen |