'Join large set of CSV files where the header is the timestamp for the file
I have a large set of CSV files. Approx. 15 000 files. And would like to figure out how to join them together as one file for data processing.
Each file is in a simple pattern with timestamp that corresponds to a period of time that represent the data in the each CSV file.
Ex.
file1.csv
2021-07-23 08:00:00
Unit.Device.No03.ErrorCode;11122233
Unit.Device.No04.ErrorCode;0
Unit.Device.No05.ErrorCode;0
Unit.Device.No11.ErrorCode;0
file2.csv
2021-07-23 08:15:00
Unit.Device.No03.ErrorCode;0
Unit.Device.No04.ErrorCode;44556666
Unit.Device.No05.ErrorCode;0
Unit.Device.No11.ErrorCode;0
Each file starts with the timestamp. I would like to join all the files in a directory, and transpose the "Unit.Device" to columns. And then use the original header as a timestamp column. For each file add a new row with the corresponding "ErrorCode" to each column.
Like this:
Timestamp;Unit.Device.No03.ErrorCode;Unit.Device.No04.ErrorCode;Unit.Device.No05.ErrorCode..
2021-07-23 08:00:00;11122233;0;0;0;0....
2021-07-23 08:15:00;0;44556666;0;0;0....
Any simple tools for this, or Python routines?
Solution 1:[1]
Thanks for the reply on my first question here! I will also contribute with a solution for this problem.
I did some read up on Pandas after I found something similar to what I wanted to do. I found that the transform method was very easy to use, and put together this snippet of Python code instead.
import pandas as pd
import os
folder = 'in'
df_out = pd.DataFrame()
for filename in os.scandir(folder):
if filename.is_file():
print('Working on file' + filename.path)
df = pd.read_csv(filename.path, encoding='utf-16', sep=';',header =[0])
# Transpose data with timestamp header to columns
df_tranposed = df.T
df_out = df_out.append(df_tranposed)
df_out.to_csv('output.csv')
Solution 2:[2]
Try the following Pandas approach:
import pandas as pd
import glob
import os
dfs = []
for csv_filename in glob.glob('./file*.csv'):
print('Working on file', csv_filename)
# Read the CSV file, assume no header and two columns
df = pd.read_csv(csv_filename, sep=';', names=[0, 1], header=None)
# Transpose from the 2nd row (skip the timestamp)
df_transposed = df[1:].T
# Allocate the column names from the first row and 'Timestamp'
df_transposed.columns = df_transposed.iloc[0] + 'Timestamp'
# Copy the timestamp into the transposed dataframe as a datetime value
df_transposed['Timestamp'] = pd.to_datetime(df.iloc[0, 0])
# Remove the first row (containing the names)
df_transposed = df_transposed[1:]
dfs.append(df_transposed)
# Concatenate all dataframes together and sort by Timestamp
df_output = pd.concat(dfs).sort_values(by='Timestamp')
# Sort the header columns and output to a CSV file
df_output.reindex(sorted(df_output.columns), axis=1).to_csv('output.csv', index=None)
Alternatively, it could be done using standard Python:
from datetime import datetime
import csv
import glob
data = []
fieldnames = set()
for fn in glob.glob('file*.csv'):
with open(fn) as f_input:
csv_input = csv.reader(f_input, delimiter=';')
timestamp = next(csv_input)[0]
row = {'Timestamp' : timestamp}
for device, error_code in csv_input:
row[device] = error_code
fieldnames.add(device)
data.append(row)
with open('output.csv', 'w', newline='') as f_output:
csv_output = csv.DictWriter(f_output, fieldnames=['Timestamp', *sorted(fieldnames)], delimiter=';')
csv_output.writeheader()
csv_output.writerows(sorted(data, key=lambda x: datetime.strptime(x['Timestamp'], '%Y-%m-%d %H:%M:%S')))
This gives output.csv
as:
Timestamp;Unit.Device.No03.ErrorCode;Unit.Device.No04.ErrorCode;Unit.Device.No05.ErrorCode;Unit.Device.No11.ErrorCode
2021-07-23 08:00:00;11122233;0;0;0
2021-07-23 08:15:00;0;44556666;0;0
How does this work
- First iterate over all
.csv
files in a given folder. - For each file open it using a
csv.reader()
- Read the header row as a special case, storing the value as a
Timestamp
entry in a dictionary row. - For each row, store additional key value entries in the row dictionary.
- Keep a note of each device name using a set.
- Append the complete row into a
data
list. - It is now possible to create an
output.csv
file. The full list of columns names can be assigned asfieldnames
and acsv.DictWriter()
used. - Write the header.
- Use
writerows()
to write all thedata
rows sorted by timestamp. To do this convert each row'sTimestamp
entry into a datetime value for sorting.
This approach will also work if the CSV files happen to have different types of devices e.g. Unit.Device.No42.ErrorCode
.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | AndreasF |
Solution 2 |