'AWS EMR s3a filesystem not found
I am running an EMR instance. It was working fine but suddenly it started giving below error when I am trying to access S3 files from a Python Spark script:
py4j.protocol.Py4JJavaError: An error occurred while calling o36.json.:
java.lang.RuntimeException:
java.lang.ClassNotFoundException:
Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
How can we resolve this?
Thanks in advance.
Solution 1:[1]
It was an issue with dependencies of spark. I had to add jars config in park-defaults.conf .
spark.jars.packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.2
Please follow below link: https://gist.github.com/eddies/f37d696567f15b33029277ee9084c4a0
Solution 2:[2]
- Download the hadoop-aws-3.2.1.jar (or any version above 2.7.10 based on your EMR version) and put it in /usr/lib/spark/jars
- Download the latest aws SDK and put it in /usr/lib/spark/jars
- update /usr/lib/spark/conf/spark-defaults.conf
- update spark.driver.extraClasspath - in the end add the full path of these 2 new jars, seperated by colon
- run spark submit after that
Note: I used AWS EMR version 6.0+
Solution 3:[3]
For Amazon EMR, use the "s3:" prefix. The S3A connector is the ASF's open source one; Amazon have their own (closed source) connector, which is the only one they support
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Raghav salotra |
Solution 2 | Shyam Prasad |
Solution 3 | stevel |