'Spark SQL error from EMR notebook with AWS Glue table partition

I'm testing some pyspark code in an EMR notebook before I deploy it and keep running into this strange error with Spark SQL. I have all my tables and metadata integrated with the AWS Glue catalog so that I can read and write to them through spark.

The first part of the code reads some data from S3/Glue, does some transformations and what not, then writes the resulting dataframe to S3/Glue like so:

df.repartition('datekey','coeff')\
    .write\
    .format('parquet')\
    .partitionBy('datekey','coeff')\
    .mode('overwrite')\
    .option("path", S3_PATH)\
    .saveAsTable('hive_tables.my_table')

I then try to access this table with Spark SQL, but when I run something as simple as spark.sql('select * from hive_tables.my_table where datekey=20210506').show(), it throws this:

An error was encountered:
"org.apache.hadoop.hive.metastore.api.InvalidObjectException: Unknown type : 'double' (Service: AWSGlue; Status Code: 400; Error Code: InvalidInputException; Request ID: 43ff3707-a44f-41be-b14a-7b9906d8d8f9; Proxy: null);"
Traceback (most recent call last):
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 778, in saveAsTable
    self._jwrite.saveAsTable(name)
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 69, in deco
    raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: "org.apache.hadoop.hive.metastore.api.InvalidObjectException: Unknown type : 'double' (Service: AWSGlue; Status Code: 400; Error Code: InvalidInputException; Request ID: 43ff3707-a44f-41be-b14a-7b9906d8d8f9; Proxy: null);"

I've learned this happens only when specifying the datekey partition. For example, both of the following commands work fine: spark.sql('select * from hive_tables.my_table where coeff=0.5').show() and spark.sql('select * from hive_tables.my_table').show()

I've verified through Spark SQL that the partitions exist and have data in them. The datekey query also works fine through AWS Athena - just not Spark SQL.

Also Glue definitely has the two partition columns recognized:

datekey: int
coeff: double

Any ideas here? I've tried everything I can think of and it just isn't making any sense.



Solution 1:[1]

I would still like a straight-forward solution to this, but currently this workaround suffices:

I first read the table straight from the S3 path

temp_df = spark.read.parquet(S3_PATH)

so that it doesn't use the Glue catalog as the metadata. Then I create a temp table for the session:

temp_df.createGlobalTempView('my_table')

which allows me to query it using Spark SQL with the global_temp database:

spark.sql('select * from global_temp.my_table where datekey=20210506').show()

and this works

Solution 2:[2]

I had same error In emr 6.3.0 (Spark 3.1.1).

After upgrade to emr 6.5.0 (Spark 3.1.2), It solved.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 Powerful Lee