'Spark : skip top rows with spark-excel
I have an excel file with damaged rows on the top (3 first rows) which needs to be skipped, I'm using spark-excel library to read the excel file, on their github there no such functionality, so is there a way to achieve this?
This my code:
Dataset<Row> ds = session.read().format("com.crealytics.spark.excel")
.option("location", filePath)
.option("sheetName", "Feuil1")
.option("useHeader", "true")
.option("delimiter", "|")
.option("treatEmptyValuesAsNulls", "true")
.option("inferSchema", "true")
.option("addColorColumns", "false")
.load(filePath);
Solution 1:[1]
This issue is fixed with spark excel 0.9.16
, issue link in github
Solution 2:[2]
I have looked at the source code and there is no option for the same
You should fix your excel file and remove the first 3 rows. Or else you would need to create a patched version of the code to allow you the same. Which would be way more effort then having a correct excel sheet
Solution 3:[3]
You can use the skipFirstRows option (I believe it is deprecated after version 0.11)
Library Dependency : "com.crealytics" %% "spark-excel" % "0.10.2"
Sample Code :
val df = sparkSession.read.format("com.crealytics.spark.excel")
.option("location", inputLocation)
.option("sheetName", "sheet1")
.option("useHeader", "true")
.option("skipFirstRows", "2") // Mention the number of top rows to be skipped
.load(inputLocation)
Hope it helps! Feel free to let me know in comments if you have any doubts/issues. Thanks!
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Abdennacer Lachiheb |
Solution 2 | Tarun Lalwani |
Solution 3 | Hema Priya Velaga |