'How to read nested xml file with python pandas?
I am using python and pandas library to read xml file. And it's great but doesn't read complete file.
python code:
df = pd.read_xml('filename.xml')
df.to_excel('filename.xlsx', index=False)
xml file:
<?xml version="1.0" encoding="UTF-8"?>
<artikli>
<artikal>
<id>197</id>
<sifra>15138</sifra>
<barKod>8605026307328</barKod>
<naziv><![CDATA[Torbica pojas sjajna za Nokia N95]]></naziv>
<kategorija1><![CDATA[MOBILNI]]></kategorija1>
<kategorija2><![CDATA[!TORBICE NA RASPRODAJI PO MODELU]]></kategorija2>
<kategorija3><![CDATA[TORBICE ZA NOKIA]]></kategorija3>
<vpCena>1.0800</vpCena>
<mpCena>998.9760</mpCena>
<dostupan>1</dostupan>
<opis><![CDATA[]]></opis>
<slike>
<slika><![CDATA[http://company3g.com/p/19/197/torbica-pojas-sjajna-za-nokia-n95-1890.jpg]]></slika>
<slika><![CDATA[http://company3g.com/p/19/197/torbica-pojas-sjajna-za-nokia-n95-1890-10064.jpg]]></slika>
<slika><![CDATA[http://company3g.com/p/19/197/torbica-pojas-sjajna-za-nokia-n95-1890-10065.jpg]]></slika>
<slika><![CDATA[http://company3g.com/p/19/197/torbica-pojas-sjajna-za-nokia-n95-1890-10066.jpg]]></slika>
</slike>
</artikal>
<artikal>
<id>198</id>
<sifra>15163</sifra>
<barKod>8605026396155</barKod>
<naziv><![CDATA[Torbica pojas sjajna za Nokia 6303]]></naziv>
<kategorija1><![CDATA[MOBILNI]]></kategorija1>
<kategorija2><![CDATA[!TORBICE NA RASPRODAJI PO MODELU]]></kategorija2>
<kategorija3><![CDATA[TORBICE ZA NOKIA]]></kategorija3>
<vpCena>1.0800</vpCena>
<mpCena>998.9760</mpCena>
<dostupan>1</dostupan>
<opis><![CDATA[]]></opis>
<slike>
<slika><![CDATA[http://company3g.com/p/19/198/torbica-pojas-sjajna-za-nokia-6303-1891-132477.png]]></slika>
<slika><![CDATA[http://company3g.com/p/19/198/torbica-pojas-sjajna-za-nokia-6303-1891-132478.png]]></slika>
<slika><![CDATA[http://company3g.com/p/19/198/torbica-pojas-sjajna-za-nokia-6303-1891-132479.png]]></slika>
</slike>
</artikal>
</artikli>
Output excel file:
id sifra barKod naziv kategorija1 kategorija2 kategorija3 vpCena mpCena dostupan opis slike
197 15138 8605026307328 Torbica pojas sjajna za Nokia N95 MOBILNI !TORBICE NA RASPRODAJI PO MODELU TORBICE ZA NOKIA 1,08 998,976 1
198 15163 8605026396155 Torbica pojas sjajna za Nokia 6303 MOBILNI !TORBICE NA RASPRODAJI PO MODELU TORBICE ZA NOKIA 1,08 998,976 1
So I am missing data in column "slike" from file. How to get that data? Desired table should looks like this:
artikal|id artikal|sifra artikal|barKod artikal|naziv artikal|kategorija1 artikal|kategorija2 artikal|kategorija3 artikal|vpCena artikal|mpCena artikal|dostupan artikal|opis artikal|slike|slika
197 15138 8605026307328 Torbica pojas sjajna za Nokia N95 MOBILNI !TORBICE NA RASPRODAJI PO MODELU TORBICE ZA NOKIA 1.0800 998.9760 1 [http://company3g.com/p/19/197/torbica-pojas-sjajna-za-nokia-n95-1890.jpg, http://company3g.com/p/19/197/torbica-pojas-sjajna-za-nokia-n95-1890-10064.jpg, http://company3g.com/p/19/197/torbica-pojas-sjajna-za-nokia-n95-1890-10065.jpg, http://company3g.com/p/19/197/torbica-pojas-sjajna-za-nokia-n95-1890-10066.jpg]
198 15163 8605026396155 Torbica pojas sjajna za Nokia 6303 MOBILNI !TORBICE NA RASPRODAJI PO MODELU TORBICE ZA NOKIA 1.0800 998.9760 1 [http://company3g.com/p/19/198/torbica-pojas-sjajna-za-nokia-6303-1891-132477.png, http://company3g.com/p/19/198/torbica-pojas-sjajna-za-nokia-6303-1891-132478.png, http://company3g.com/p/19/198/torbica-pojas-sjajna-za-nokia-6303-1891-132479.png]
Solution 1:[1]
While XML as a data format can take many forms from flat to deeply nested, data frames must adhere to a single structure of two dimensions: row by column. Hence, as noted in docs, pandas.read_xml
, is a convenience method best for flatter, shallow XML files. You can use xpath
to traverse different areas of the document, not just the default /*
.
However, you can use XSLT 1.0 (special purpose language designed to transform XML files) with the default parser, lxml
, to transform any XML to the needed flat format of data frame. Below stylesheet will restyle the <slike>
node for comma-separated text of its children <slika>
:
XSLT (save as .xsl file, a special .xml file)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" omit-xml-declaration="no" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="node()|@*">
<xsl:copy>
<xsl:apply-templates select="node()|@*"/>
</xsl:copy>
</xsl:template>
<xsl:template match="slike">
<xsl:copy>
<xsl:for-each select="*">
<xsl:value-of select="text()"/>
<xsl:if test="position() != last()">
<xsl:text>,</xsl:text>
</xsl:if>
</xsl:for-each>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
Python
artikal_df = pd.read_xml("my_filename.xml", stylesheet="my_style.xsl")
# CONVERT COMMA-SEPARATED VALUES TO EMBEDDED LISTS
artikal_df["slike"] = artikal_df["slike"].str.split(',')
# PREFIX PARENT NODE NAME
artikal_df = artikal_df.add_prefix('artikal_')
artikal_df.info()
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 2 entries, 0 to 1
# Data columns (total 12 columns):
# # Column Non-Null Count Dtype
# --- ------ -------------- -----
# 0 artikal_id 2 non-null int64
# 1 artikal_sifra 2 non-null int64
# 2 artikal_barKod 2 non-null int64
# 3 artikal_naziv 2 non-null object
# 4 artikal_kategorija1 2 non-null object
# 5 artikal_kategorija2 2 non-null object
# 6 artikal_kategorija3 2 non-null object
# 7 artikal_vpCena 2 non-null float64
# 8 artikal_mpCena 2 non-null float64
# 9 artikal_dostupan 2 non-null int64
# 10 artikal_opis 0 non-null float64
# 11 artikal_slike 2 non-null object
# dtypes: float64(3), int64(4), object(5)
# memory usage: 320.0+ bytes
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 |