'exclusions doesn't work in AWS Glue ELT job s3 connection

According to AWS Glue documentation, we can use exlusions to exclude files when the connection type is s3:

https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect.html

"exclusions": (Optional) A string containing a JSON list of Unix-style glob patterns to exclude. For example, "[\"**.pdf\"]" excludes all PDF files. For more information about the glob syntax that AWS Glue supports, see Include and Exclude Patterns.

My s3 bucket likes following and I want to exclude test1 folder.

/mykkkkkk-test
   test1/
      testfolder/
         11.json
         22.json
   test2/
      1.json
   test3/
      2.json
   test4/
      3.json
   test5/
      4.json

I use following code to exclude test1 folder, but it will still ETL files under my test1 folder and it doesn't work

datasource0 = glueContext.create_dynamic_frame_from_options("s3",
    {'paths': ["s3://mykkkkkk-test/"],
    'exclusions': "[\"test1/**\"]",
    'recurse':True,
    'groupFiles': 'inPartition',
    'groupSize': '1048576'}, 
    format="json",
    transformation_ctx = "datasource0")

Does the exclusions really work in ETL pyspark script? I also tried following but none works

'exclusions': "[\"test1/**\"]",
'exclusions': ["test1/**"],
'exclusions': "[\"test1\"]",

pyspark aws-glue

Solution 1:^[1]

Try using the full path for exclusion.

datasource0 = glueContext.create_dynamic_frame.from_options(
's3',
{
    "paths": [
        's3://bucket/sample_data/'
    ],
    "recurse" : True,
    "exclusions" :  "[\"s3://bucket/sample_data/temp/**\"]"
},
"json",
transformation_ctx = "datasource0")

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1

'exclusions doesn't work in AWS Glue ELT job s3 connection

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]