'How to create child dataframe from xml file using Pyspark?
I have all those supporting libraries in pyspark and I am able to create dataframe for parent-
def xmlReader(root, row, filename):
df = spark.read.format("com.databricks.spark.xml").options(rowTag=row,rootTag=root).load(filename)
xref = df.select("genericEntity.entityId", "genericEntity.entityName","genericEntity.entityType","genericEntity.inceptionDate","genericEntity.updateTimestamp","genericEntity.entityLongName")
return xref
df1 = xmlReader("BOBML","entityTransaction","s3://dev.xml")
df1.head()
I am unable to create child dataframe-
def xmlReader(root, row, filename):
df2 = spark.read.format("com.databricks.spark.xml").options(rowTag=row, rootTag=root).load(filename)
xref = df2.select("genericEntity.entityDetail", "genericEntity.entityDetialId","genericEntity.updateTimestamp")
return xref
df3 = xmlReader("BOBML","s3://dev.xml")
df3.head()
I am not getting any output and I was planning to do union between parent and child dataframe. Any help will be truly appreciated!
Solution 1:[1]
After more than 24 hours, I am able to solve the problem and thanks to all whoever at least look at my problem.
Solution:
Step 1: Upload couple of libraries
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
Step2 (Parents):Read xml files, print schema, register temp tables, and create dataframe.
Step3 (Child): Repeat step 2.
Step4: Create final Dataframe by joining Child and parent dataframes.
Step5: Load data into S3 (write.csv/S3://Path) or Database.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 |