'Processing large XLSX file in python
I have a large xlsx Excel file (56mb, 550k rows) from which I tried to read the first 10 rows. I tried using xlrd
, openpyxl
, and pyexcel-xlsx
, but they always take more than 35 mins because it loads the whole file in memory.
I unzipped the Excel file and found out that the xml
which contains the data I need is 800mb unzipped.
When you load the same file in Excel it takes 30 seconds. I'm wondering why it takes that much time in Python?
Solution 1:[1]
Use openpyxl's read-only mode to do this.
You'll be able to work with the relevant worksheet instantly.
Solution 2:[2]
Here is it, i found a solution. The fastest way to read an xlsx sheet.
56mb file with over 500k rows and 4 sheets took 6s to proceed.
import zipfile
from bs4 import BeautifulSoup
paths = []
mySheet = 'Sheet Name'
filename = 'xlfile.xlsx'
file = zipfile.ZipFile(filename, "r")
for name in file.namelist():
if name == 'xl/workbook.xml':
data = BeautifulSoup(file.read(name), 'html.parser')
sheets = data.find_all('sheet')
for sheet in sheets:
paths.append([sheet.get('name'), 'xl/worksheets/sheet' + str(sheet.get('sheetid')) + '.xml'])
for path in paths:
if path[0] == mySheet:
with file.open(path[1]) as reader:
for row in reader:
print(row) ## do what ever you want with your data
reader.close()
Enjoy and happy coding.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Ty Hitzeman |
Solution 2 | Amine |