'Spark Catalog w/ AWS Glue: database not found
Ive created an EMR cluster with the Glue Data catalog. When I invoke the spark-shell, I am able to successfully list tables stored within a Glue database via
spark.catalog.setCurrentDatabase("test")
spark.catalog.listTables
However when I submit a job via spark-submit
I get a fatal error
ERROR ApplicationMaster: User class threw exception: org.apache.spark.sql.AnalysisException: Database 'test' does not exist.;
I am creating my SparkSession within the job being submitted via spark-submit
via
SparkSession.builder.enableHiveSupport.getOrCreate
Solution 1:[1]
Adding the hive.metastore.client.factory.class
configuration to the code initiating the spark session solved the issue for me:
SparkSession spark = SparkSession.builder()
...
.config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory")
.enableHiveSupport()
.getOrCreate();
that's the same configuration defined in aws docs (https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-glue.html) and added to the cluster configuration when checking Use for Hive table metadata
on cluster creation, but for some reason dosn't work as expected (I'm using emr 5.12.0).
Solution 2:[2]
I had the same issue: spark-submit
will not discover the AWS Glue libraries, but spark-shell
working on the master node will.
It turns out that my spark-submit
job uses a fat .jar
which was compiled with the standard org.apache.spark
and org.apache.hive
libraries. The jar libraries were being used in stead of the custom classes installed on EMR
.
If this is the case with you, make sure to exclude all:
'org.apache.spark:' 'org.apache.hive:' 'org.apache.hadoop:' modules from you
.jar
Here is the reference I used for .Gradle
: http://unethicalblogger.com/2015/07/15/gradle-goodness-excluding-depends-from-shadow.html.
Adding compileOnly
keyword in front of all spark libraries fixed it.
Solution 3:[3]
Our issue was IAM permissions on the EMR cluster; make sure that the cluster IAM instance profile has full access to glue.
Solution 4:[4]
My problem ended up being that another classification configuration had been interfering with the spark-hive-site
one. I deleted all others, and it finally was able to connect.
Solution 5:[5]
EMR 5.9.0 has just been released - please give it a shot, it should work for you.
Relevant documentation:
http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-components.html
http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-glue.html
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Avi |
Solution 2 | Mirela Spasova |
Solution 3 | autodidacticon |
Solution 4 | Laurel |
Solution 5 | Al Belsky |