'Join large set of CSV files where the header is the timestamp for the file

I have a large set of CSV files. Approx. 15 000 files. And would like to figure out how to join them together as one file for data processing.

Each file is in a simple pattern with timestamp that corresponds to a period of time that represent the data in the each CSV file.

Ex.

file1.csv

2021-07-23 08:00:00
Unit.Device.No03.ErrorCode;11122233
Unit.Device.No04.ErrorCode;0
Unit.Device.No05.ErrorCode;0
Unit.Device.No11.ErrorCode;0

file2.csv

2021-07-23 08:15:00
Unit.Device.No03.ErrorCode;0
Unit.Device.No04.ErrorCode;44556666
Unit.Device.No05.ErrorCode;0
Unit.Device.No11.ErrorCode;0

Each file starts with the timestamp. I would like to join all the files in a directory, and transpose the "Unit.Device" to columns. And then use the original header as a timestamp column. For each file add a new row with the corresponding "ErrorCode" to each column.

Like this:

Timestamp;Unit.Device.No03.ErrorCode;Unit.Device.No04.ErrorCode;Unit.Device.No05.ErrorCode..
2021-07-23 08:00:00;11122233;0;0;0;0....
2021-07-23 08:15:00;0;44556666;0;0;0....

Any simple tools for this, or Python routines?



Solution 1:[1]

Thanks for the reply on my first question here! I will also contribute with a solution for this problem.

I did some read up on Pandas after I found something similar to what I wanted to do. I found that the transform method was very easy to use, and put together this snippet of Python code instead.

import pandas as pd
import os

folder = 'in'

df_out = pd.DataFrame()

for filename in os.scandir(folder):
    if filename.is_file():
        print('Working on file' + filename.path)

        df = pd.read_csv(filename.path, encoding='utf-16', sep=';',header =[0])

        # Transpose data with timestamp header to columns
        df_tranposed = df.T
        df_out = df_out.append(df_tranposed)


df_out.to_csv('output.csv')

Solution 2:[2]

Try the following Pandas approach:

import pandas as pd
import glob
import os

dfs = []

for csv_filename in glob.glob('./file*.csv'):
    print('Working on file', csv_filename)
    # Read the CSV file, assume no header and two columns
    df = pd.read_csv(csv_filename, sep=';', names=[0, 1], header=None)
    # Transpose from the 2nd row (skip the timestamp)
    df_transposed = df[1:].T
    # Allocate the column names from the first row and 'Timestamp'
    df_transposed.columns = df_transposed.iloc[0] + 'Timestamp'
    # Copy the timestamp into the transposed dataframe as a datetime value
    df_transposed['Timestamp'] = pd.to_datetime(df.iloc[0, 0])
    # Remove the first row (containing the names)
    df_transposed = df_transposed[1:]
    dfs.append(df_transposed)

# Concatenate all dataframes together and sort by Timestamp
df_output = pd.concat(dfs).sort_values(by='Timestamp')
# Sort the header columns and output to a CSV file
df_output.reindex(sorted(df_output.columns), axis=1).to_csv('output.csv', index=None)

Alternatively, it could be done using standard Python:

from datetime import datetime
import csv
import glob

data = []
fieldnames = set()

for fn in glob.glob('file*.csv'):
    with open(fn) as f_input:
        csv_input = csv.reader(f_input, delimiter=';')
        timestamp = next(csv_input)[0]
        row = {'Timestamp' : timestamp}
        
        for device, error_code in csv_input:
            row[device] = error_code
            fieldnames.add(device)
            
        data.append(row)

with open('output.csv', 'w', newline='') as f_output:
    csv_output = csv.DictWriter(f_output, fieldnames=['Timestamp', *sorted(fieldnames)], delimiter=';')
    csv_output.writeheader()
    csv_output.writerows(sorted(data, key=lambda x: datetime.strptime(x['Timestamp'], '%Y-%m-%d %H:%M:%S')))

This gives output.csv as:

Timestamp;Unit.Device.No03.ErrorCode;Unit.Device.No04.ErrorCode;Unit.Device.No05.ErrorCode;Unit.Device.No11.ErrorCode
2021-07-23 08:00:00;11122233;0;0;0
2021-07-23 08:15:00;0;44556666;0;0

How does this work

  1. First iterate over all .csv files in a given folder.
  2. For each file open it using a csv.reader()
  3. Read the header row as a special case, storing the value as a Timestamp entry in a dictionary row.
  4. For each row, store additional key value entries in the row dictionary.
  5. Keep a note of each device name using a set.
  6. Append the complete row into a data list.
  7. It is now possible to create an output.csv file. The full list of columns names can be assigned as fieldnames and a csv.DictWriter() used.
  8. Write the header.
  9. Use writerows() to write all the data rows sorted by timestamp. To do this convert each row's Timestamp entry into a datetime value for sorting.

This approach will also work if the CSV files happen to have different types of devices e.g. Unit.Device.No42.ErrorCode.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 AndreasF
Solution 2