'pyspark SQL cannot resolve 'explode()' due to data type mismatch

Running Pyspark script getting the following error depending on which xml I query:

cannot resolve 'explode(...)' due to data type mismatch

The pyspark code:

from pyspark.sql import SparkSession

JOB_NAME = "Complex file to delimeted files transformer"
spark = SparkSession.builder.appName(JOB_NAME)\
    .config("spark.scheduler.mode", "FAIR")\
    .config('spark.jars.packages', 'com.databricks:spark-xml_2.12:0.12.0')\
    .getOrCreate()

sql_script = "select create_date, item['_id'], item['_VALUE'] from my_data lateral view explode(items.item) t as item"

# works fine
read_options = {"rowTag": "my_data"}
df = spark.read\
    .format("xml")\
    .options(**read_options)\
    .load("./xml")
df.createOrReplaceTempView("my_data")
spark.sql(sql_script).show()

# Error
df2 = spark.read\
    .format("xml")\
    .options(**read_options)\
    .load("./xml/test2.xml")
df2.createOrReplaceTempView("my_data")
spark.sql(sql_script).show()

the xml is in xml folder.

test1.xml:

<my_data>
    <create_date>2021-05-01</create_date>
    <items>
        <item id="1">item 1</item>
        <item id="2">item 2</item>
    </items>
</my_data>

test2.xml:

<my_data>
    <create_date>2021-06-01</create_date>
    <items>
        <item id="3">item 3</item>
    </items>
</my_data>

Expected result: the same SQL statement should work all the time and not break, nor have a chance of erroring if one run happens to have only one <item> in <items>.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source