'exclusions doesn't work in AWS Glue ELT job s3 connection
According to AWS Glue documentation, we can use exlusions
to exclude files when the connection type is s3
:
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect.html
"exclusions": (Optional) A string containing a JSON list of Unix-style glob patterns to exclude. For example, "[\"**.pdf\"]" excludes all PDF files. For more information about the glob syntax that AWS Glue supports, see Include and Exclude Patterns.
My s3 bucket likes following and I want to exclude test1 folder.
/mykkkkkk-test
test1/
testfolder/
11.json
22.json
test2/
1.json
test3/
2.json
test4/
3.json
test5/
4.json
I use following code to exclude test1 folder, but it will still ETL files under my test1 folder and it doesn't work
datasource0 = glueContext.create_dynamic_frame_from_options("s3",
{'paths': ["s3://mykkkkkk-test/"],
'exclusions': "[\"test1/**\"]",
'recurse':True,
'groupFiles': 'inPartition',
'groupSize': '1048576'},
format="json",
transformation_ctx = "datasource0")
Does the exclusions
really work in ETL pyspark script? I also tried following but none works
'exclusions': "[\"test1/**\"]",
'exclusions': ["test1/**"],
'exclusions': "[\"test1\"]",
Solution 1:[1]
Try using the full path for exclusion.
datasource0 = glueContext.create_dynamic_frame.from_options(
's3',
{
"paths": [
's3://bucket/sample_data/'
],
"recurse" : True,
"exclusions" : "[\"s3://bucket/sample_data/temp/**\"]"
},
"json",
transformation_ctx = "datasource0")
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 |