'Can you get filename from input_file_name() in aws s3 when using gunzipped files

Been searching for an answer to do this for quite awhile now but can't seem to figure it out. I've read Why is input_file_name() empty for S3 catalog sources in pyspark? and tried everything in this questions but none of it worked. I'm trying to get the filename of each record in the source s3 bucket but blank keeps getting returned. I'm thinking it could be to do with that the files are gunzipped as it worked perfectly before they were. Can't seem to find anywhere that this should be an issue. Does anyone know if this is an issue or if it is something else to do with my code?

Thank you!

def main():

    glue_context = GlueContext(sc.getOrCreate())


    #create a source dataframe for the bronze table
    dyf_bronze_table = glue_context.create_dynamic_frame.from_catalog(
        database=DATABASE
        , table_name=TABLE
        , groupFiles='none'
    )

    #Add file location to join postgres database on
    bronze_df = dyf_bronze_table.toDF()
    bronze_df = bronze_df.withColumn("s3_location", input_file_name())
    bronze_df.show()


Solution 1:[1]

The problem was in my terraform file. I had set the

compressionType = "gzip"

and the

format = gzip 

also. Once I removed these the filename was populated.

After reading through some of the documentation though I wouldn't recommend gunzipping the files (maybe use parquet instead) as when the files are gunzipped it can't shard them so instead of working on the data on multiple dpus it has to work through each file individually.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 A_Lopez