'Where to set the S3 configuration in Spark locally?

I've setup a docker container that is starting a jupyter notebook using spark. I've integrated the necessary jars into spark's directoy for being able to access the S3 filesystem. My Dockerfile:

FROM jupyter/pyspark-notebook

EXPOSE 8080 7077 6066

RUN conda install -y --prefix /opt/conda pyspark==3.2.1
USER root

RUN (cd /usr/local/spark/jars && wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.2.1/hadoop-aws-3.2.1.jar)
RUN (cd /usr/local/spark/jars && wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.213/aws-java-sdk-bundle-1.12.213.jar )

# The aws sdk relies on guava, but the default guava lib in jars is too old for being compatible
RUN rm /usr/local/spark/jars/guava-14.0.1.jar
RUN (cd /usr/local/spark/jars && wget https://repo1.maven.org/maven2/com/google/guava/guava/29.0-jre/guava-29.0-jre.jar )

USER jovyan

ENV AWS_ACCESS_KEY_ID=XXXXX
ENV AWS_SECRET_ACCESS_KEY=XXXXX
ENV PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser"
ENV PYSPARK_DRIVER_PYTHON=/opt/conda/bin/jupyter

This works nicely so far. However, everytime I'm creating a kernel session in jupyter, I do need to setup the EnvironmentCredentialsProvider manually, because by default, it's expecting the IAMInstanceCredentialsProvider to deliver the credentials which obviously isn't there. Because of this, I need to set this in jupyter everytime:

spark._jsc.hadoopConfiguration().set("fs.s3a.aws.credentials.provider", "com.amazonaws.auth.EnvironmentVariableCredentialsProvider")

Can I configure this somewhere in a file, so that the credentialprovider is set correctly be default?

I've tried to create ~/.aws/credentials to see if spark would read the credentials by default from there but nope.



Solution 1:[1]

After a few days of browsing the web, I found the correct properties (not in the official documentation though) that were missing in the spark-defaults.conf:

spark.hadoop.fs.s3a.aws.credentials.provider com.amazonaws.auth.EnvironmentVariableCredentialsProvider

Solution 2:[2]

s3a connector actually looks for s3a options, then env vars, before IAM properties https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Authenticating_with_S3

something may be wrong with your spark defaults config file

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Mayak
Solution 2 stevel