'Using Python, Create Data Frame from Excel Table and Add Columns Containing Values from Certain Cells

I have a multiple excel files with multiple sheets and am trying to consolidate everything into a single data frame with certain cells added as new columns. Using the example below, each sheet has the team name in cell A1 and the Date in cell A2 and I want to convert those cells to columns in the output file as in the second screenshot so when each sheet and workbook book are combined, they are organized. I can do this for a single file using the code below, but don't know how to do it for every file in a given directory.

Raw Table Format

Desired Format

import pandas as pd
from openpyxl import load_workbook

path = "[path]"

#data table starts in row 3
data = pd.read_excel(path,skiprows=2,sheet_name=None)

teams = []
dates = []
data2 = []

wb = load_workbook(path)
 
#getting the teams and dates from the specified cell in each sheet.
for i, sheet in enumerate(wb):
    teams.append(sheet['A1'].value)
for i, sheet in enumerate(wb):
    dates.append(sheet['A2'].value)

wb = None
sheet = None

#adding the team and date values as a new column
for i, item in enumerate(data.items()):
   data2.append(item)
   data2[i][1].insert(loc=0,column="Team",value=teams[i])
   data2[i][1].insert(loc=1,column="Date",value=dates[i])
 
#combining everything into a single data frame
output = pd.DataFrame(columns = data2[1][1].columns)
for i,item in enumerate(data2):
    output = pd.concat([output,data2[i][1]],ignore_index=True)


Solution 1:[1]

Welcome to StackOverflow.... you can use the below updated code to run through all files in DataFiles directory under the directory where you are running the program. Also, the program will pick up only XLSX and XLSM files, ignoring any other files that may be present. You change it using the ALLOWED_EXTENSIONS list. As earlier, the output dataframe will have all data

import pandas as pd
from openpyxl import load_workbook

import os
currdir = os.getcwd()
DirPath = os.path.join(currdir,"DataFiles",'')  #Assuming the files are in DataFiles folder under DIR where you are running code

ALLOWED_EXTENSIONS = set(['xlsx', 'xlsm'])

def allowed_file(filename):
    return '.' in filename and filename.rsplit('.', 1)[1].lower() in ALLOWED_EXTENSIONS

#Dataframe created - add everything into this
output = pd.DataFrame(columns = ['Team', 'Date', 'Player', 'Rank', 'Points'])

for filename in os.listdir(DirPath):
    path = os.path.join(DirPath, filename)
    if os.path.isfile(path) and allowed_file(filename):
        #data table starts in row 3
        data = pd.read_excel(path,skiprows=2,sheet_name=None)

        teams = []
        dates = []
        data2 = []

        wb = load_workbook(path)

        #getting the teams and dates from the specified cell in each sheet.
        for i, sheet in enumerate(wb):
            teams.append(sheet['A1'].value)
        for i, sheet in enumerate(wb):
            dates.append(sheet['A2'].value)

        wb = None
        sheet = None

        #adding the team and date values as a new column
        for i, item in enumerate(data.items()):
            data2.append(item)
            data2[i][1].insert(loc=0,column="Team",value=teams[i])
            data2[i][1].insert(loc=1,column="Date",value=dates[i])

        for i,item in enumerate(data2):
            output = pd.concat([output,data2[i][1]],ignore_index=True)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Redox