'Python script to read multiple excel files in one directory and convert them to .csv files in another directory
I am relatively new to python and Stackoverflow but hoping anyone can shed some light of my current problem. I have a python script that takes excel files (.xls, and .xlsx) from one directory and converts them to .csv files to another directory. It works perfectly fine on my sample excel files (consisted of 4 columns and 1 row for the purpose of testing), but when I try to run my script against a different directory that has excel files (alot larger in file size) I am getting an assertion error. I have attached my code and the error. Looking forward to have some guidance on this problem. Thanks!
import os
import pandas as pd
source = "C:/.../TestFolder"
output = "C:/.../OutputCSV"
dir_list = os.listdir(source)
os.chdir(source)
for i in range(len(dir_list)):
filename = dir_list[i]
book = pd.ExcelFile(filename)
#writing to csv
if filename.endswith('.xlsx') or filename.endswith('.xls'):
for i in range(len(book.sheet_names)):
df = pd.read_excel(book, book.sheet_names[i])
os.chdir(output)
new_name = filename.split('.')[0] + str(book.sheet_names[i])+'.csv'
df.to_csv(new_name, index = False)
os.chdir(source)
print "New files: ", os.listdir(output)
Solution 1:[1]
Since you use Windows, consider the Jet/ACE SQL engine (Windows .dll files) to query Excel workbooks and export to CSV files, bypassing needs to load/export with pandas dataframes.
Specifically, use pyodbc
to make the ODBC connection to Excel files, iterate through each sheet and export to csv files using SELECT * INTO ...
SQL action query. The openpyxl
module is used to retrieve sheet names. Below script does not rely on relative paths so can be run from anywhere. It is assumed each Excel file has complete header columns (no missing cells in used range of top row).
import os
import pyodbc
from openpyxl import load_workbook
source = "C:/Path/To/TestFolder"
output = "C:/Path/To/OutputCSV"
dir_list = os.listdir(source)
for xlfile in dir_list:
strfile = os.path.join(source, xlfile)
if strfile.endswith('.xlsx') or strfile.endswith('.xls'):
# CONNECT TO WORKBOOK
conn = pyodbc.connect(r'Driver={Microsoft Excel Driver (*.xls, *.xlsx, *.xlsm, *.xlsb)};' + \
'DBQ={};'.format(strfile), autocommit=True)
# RETRIEVE WORKBOOK SHEETS
sheets = load_workbook(filename = strfile, use_iterators = True).get_sheet_names()
# ITERATIVELY EXPORT SHEETS TO CSV IN OUTPUT FOLDER
for s in sheets:
outfile = os.path.join(output, '{0}_{1}.csv'.format(xlfile.split('.')[0], s))
if os.path.exists(outfile): os.remove(outfile)
strSQL = " SELECT * " + \
" INTO [text;HDR=Yes;Database={0};CharacterSet=65001].[{1}]" + \
" FROM [{2}$]"
conn.execute(strSQL.format(output, os.path.basename(outfile, s))
conn.close()
**Note: this process creates a schema.ini
file that concatenates with each iteration. Can be deleted.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Parfait |