'Spark SQL error from EMR notebook with AWS Glue table partition
I'm testing some pyspark code in an EMR notebook before I deploy it and keep running into this strange error with Spark SQL. I have all my tables and metadata integrated with the AWS Glue catalog so that I can read and write to them through spark.
The first part of the code reads some data from S3/Glue, does some transformations and what not, then writes the resulting dataframe to S3/Glue like so:
df.repartition('datekey','coeff')\
.write\
.format('parquet')\
.partitionBy('datekey','coeff')\
.mode('overwrite')\
.option("path", S3_PATH)\
.saveAsTable('hive_tables.my_table')
I then try to access this table with Spark SQL, but when I run something as simple as
spark.sql('select * from hive_tables.my_table where datekey=20210506').show()
,
it throws this:
An error was encountered:
"org.apache.hadoop.hive.metastore.api.InvalidObjectException: Unknown type : 'double' (Service: AWSGlue; Status Code: 400; Error Code: InvalidInputException; Request ID: 43ff3707-a44f-41be-b14a-7b9906d8d8f9; Proxy: null);"
Traceback (most recent call last):
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 778, in saveAsTable
self._jwrite.saveAsTable(name)
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: "org.apache.hadoop.hive.metastore.api.InvalidObjectException: Unknown type : 'double' (Service: AWSGlue; Status Code: 400; Error Code: InvalidInputException; Request ID: 43ff3707-a44f-41be-b14a-7b9906d8d8f9; Proxy: null);"
I've learned this happens only when specifying the datekey partition. For example, both of the following commands work fine:
spark.sql('select * from hive_tables.my_table where coeff=0.5').show()
and
spark.sql('select * from hive_tables.my_table').show()
I've verified through Spark SQL that the partitions exist and have data in them. The datekey
query also works fine through AWS Athena - just not Spark SQL.
Also Glue definitely has the two partition columns recognized:
datekey: int
coeff: double
Any ideas here? I've tried everything I can think of and it just isn't making any sense.
Solution 1:[1]
I would still like a straight-forward solution to this, but currently this workaround suffices:
I first read the table straight from the S3 path
temp_df = spark.read.parquet(S3_PATH)
so that it doesn't use the Glue catalog as the metadata. Then I create a temp table for the session:
temp_df.createGlobalTempView('my_table')
which allows me to query it using Spark SQL with the global_temp
database:
spark.sql('select * from global_temp.my_table where datekey=20210506').show()
and this works
Solution 2:[2]
I had same error In emr 6.3.0 (Spark 3.1.1).
After upgrade to emr 6.5.0 (Spark 3.1.2), It solved.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | Powerful Lee |