'Where to set the S3 configuration in Spark locally?
I've setup a docker container that is starting a jupyter notebook using spark. I've integrated the necessary jars into spark's directoy for being able to access the S3 filesystem. My Dockerfile:
FROM jupyter/pyspark-notebook
EXPOSE 8080 7077 6066
RUN conda install -y --prefix /opt/conda pyspark==3.2.1
USER root
RUN (cd /usr/local/spark/jars && wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.2.1/hadoop-aws-3.2.1.jar)
RUN (cd /usr/local/spark/jars && wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.213/aws-java-sdk-bundle-1.12.213.jar )
# The aws sdk relies on guava, but the default guava lib in jars is too old for being compatible
RUN rm /usr/local/spark/jars/guava-14.0.1.jar
RUN (cd /usr/local/spark/jars && wget https://repo1.maven.org/maven2/com/google/guava/guava/29.0-jre/guava-29.0-jre.jar )
USER jovyan
ENV AWS_ACCESS_KEY_ID=XXXXX
ENV AWS_SECRET_ACCESS_KEY=XXXXX
ENV PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser"
ENV PYSPARK_DRIVER_PYTHON=/opt/conda/bin/jupyter
This works nicely so far. However, everytime I'm creating a kernel session in jupyter, I do need to setup the EnvironmentCredentialsProvider manually, because by default, it's expecting the IAMInstanceCredentialsProvider to deliver the credentials which obviously isn't there. Because of this, I need to set this in jupyter everytime:
spark._jsc.hadoopConfiguration().set("fs.s3a.aws.credentials.provider", "com.amazonaws.auth.EnvironmentVariableCredentialsProvider")
Can I configure this somewhere in a file, so that the credentialprovider is set correctly be default?
I've tried to create ~/.aws/credentials to see if spark would read the credentials by default from there but nope.
Solution 1:[1]
After a few days of browsing the web, I found the correct properties (not in the official documentation though) that were missing in the spark-defaults.conf:
spark.hadoop.fs.s3a.aws.credentials.provider com.amazonaws.auth.EnvironmentVariableCredentialsProvider
Solution 2:[2]
s3a connector actually looks for s3a options, then env vars, before IAM properties https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Authenticating_with_S3
something may be wrong with your spark defaults config file
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Mayak |
Solution 2 | stevel |