'Python script to read multiple excel files in one directory and convert them to .csv files in another directory

I am relatively new to python and Stackoverflow but hoping anyone can shed some light of my current problem. I have a python script that takes excel files (.xls, and .xlsx) from one directory and converts them to .csv files to another directory. It works perfectly fine on my sample excel files (consisted of 4 columns and 1 row for the purpose of testing), but when I try to run my script against a different directory that has excel files (alot larger in file size) I am getting an assertion error. I have attached my code and the error. Looking forward to have some guidance on this problem. Thanks!

import os
import pandas as pd

source = "C:/.../TestFolder"
output = "C:/.../OutputCSV"

dir_list = os.listdir(source)

os.chdir(source)

for i in range(len(dir_list)):
    filename = dir_list[i]
    book = pd.ExcelFile(filename)

    #writing to csv
    if filename.endswith('.xlsx') or filename.endswith('.xls'):
        for i in range(len(book.sheet_names)):
            df = pd.read_excel(book, book.sheet_names[i])

            os.chdir(output)

            new_name = filename.split('.')[0] + str(book.sheet_names[i])+'.csv'
            df.to_csv(new_name, index = False)

        os.chdir(source)

print "New files: ", os.listdir(output)

The error message I receive when executing this script in the terminal shows as below:



Solution 1:[1]

Since you use Windows, consider the Jet/ACE SQL engine (Windows .dll files) to query Excel workbooks and export to CSV files, bypassing needs to load/export with pandas dataframes.

Specifically, use pyodbc to make the ODBC connection to Excel files, iterate through each sheet and export to csv files using SELECT * INTO ... SQL action query. The openpyxl module is used to retrieve sheet names. Below script does not rely on relative paths so can be run from anywhere. It is assumed each Excel file has complete header columns (no missing cells in used range of top row).

import os
import pyodbc
from openpyxl import load_workbook

source = "C:/Path/To/TestFolder"
output = "C:/Path/To/OutputCSV"

dir_list = os.listdir(source)

for xlfile in dir_list:
    strfile = os.path.join(source, xlfile)

    if strfile.endswith('.xlsx') or strfile.endswith('.xls'):
        # CONNECT TO WORKBOOK
        conn = pyodbc.connect(r'Driver={Microsoft Excel Driver (*.xls, *.xlsx, *.xlsm, *.xlsb)};' + \
                               'DBQ={};'.format(strfile), autocommit=True)
        # RETRIEVE WORKBOOK SHEETS
        sheets = load_workbook(filename = strfile, use_iterators = True).get_sheet_names()

        # ITERATIVELY EXPORT SHEETS TO CSV IN OUTPUT FOLDER
        for s in sheets:
            outfile = os.path.join(output, '{0}_{1}.csv'.format(xlfile.split('.')[0], s))
            if os.path.exists(outfile): os.remove(outfile)

            strSQL = " SELECT * " + \
                     " INTO [text;HDR=Yes;Database={0};CharacterSet=65001].[{1}]" + \
                     " FROM [{2}$]"            
            conn.execute(strSQL.format(output, os.path.basename(outfile, s))
        conn.close()

**Note: this process creates a schema.ini file that concatenates with each iteration. Can be deleted.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Parfait