'How to create child dataframe from xml file using Pyspark?

I have all those supporting libraries in pyspark and I am able to create dataframe for parent-

def xmlReader(root, row, filename):

  df = spark.read.format("com.databricks.spark.xml").options(rowTag=row,rootTag=root).load(filename)
  xref = df.select("genericEntity.entityId", "genericEntity.entityName","genericEntity.entityType","genericEntity.inceptionDate","genericEntity.updateTimestamp","genericEntity.entityLongName")
  return xref 

df1 = xmlReader("BOBML","entityTransaction","s3://dev.xml")

df1.head()

I am unable to create child dataframe-

def xmlReader(root, row, filename):

  df2 = spark.read.format("com.databricks.spark.xml").options(rowTag=row, rootTag=root).load(filename)
  xref = df2.select("genericEntity.entityDetail", "genericEntity.entityDetialId","genericEntity.updateTimestamp")
  return xref

df3 = xmlReader("BOBML","s3://dev.xml")

df3.head()

I am not getting any output and I was planning to do union between parent and child dataframe. Any help will be truly appreciated!



Solution 1:[1]

After more than 24 hours, I am able to solve the problem and thanks to all whoever at least look at my problem.

Solution:

Step 1: Upload couple of libraries

 from pyspark.sql import SparkSession

from pyspark.sql import SQLContext

sqlContext = SQLContext(sc)

Step2 (Parents):Read xml files, print schema, register temp tables, and create dataframe.

Step3 (Child): Repeat step 2.

Step4: Create final Dataframe by joining Child and parent dataframes.

Step5: Load data into S3 (write.csv/S3://Path) or Database.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1