'AWS EMR s3a filesystem not found

I am running an EMR instance. It was working fine but suddenly it started giving below error when I am trying to access S3 files from a Python Spark script:

py4j.protocol.Py4JJavaError: An error occurred while calling o36.json.: 
   java.lang.RuntimeException: 
     java.lang.ClassNotFoundException: 
       Class org.apache.hadoop.fs.s3a.S3AFileSystem not found

How can we resolve this?

Thanks in advance.



Solution 1:[1]

It was an issue with dependencies of spark. I had to add jars config in park-defaults.conf .

spark.jars.packages                com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.2

Please follow below link: https://gist.github.com/eddies/f37d696567f15b33029277ee9084c4a0

Solution 2:[2]

  1. Download the hadoop-aws-3.2.1.jar (or any version above 2.7.10 based on your EMR version) and put it in /usr/lib/spark/jars
  2. Download the latest aws SDK and put it in /usr/lib/spark/jars
  3. update /usr/lib/spark/conf/spark-defaults.conf
  4. update spark.driver.extraClasspath - in the end add the full path of these 2 new jars, seperated by colon
  5. run spark submit after that

Note: I used AWS EMR version 6.0+

Solution 3:[3]

For Amazon EMR, use the "s3:" prefix. The S3A connector is the ASF's open source one; Amazon have their own (closed source) connector, which is the only one they support

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Raghav salotra
Solution 2 Shyam Prasad
Solution 3 stevel