'How to optimise reading multiple files containing bytes into a NumPy array

I currently have ~1000 files containing bytes. Each file contains a few thousand messages, each message has identical data types.

I've tried several ways of reading this into a numpy array but I'm curious how fast I can get it as currently it's pretty slow with all attempts.

In terms of reading it into python from the files, I've found it's much faster creating a bytearray up front with the correct size of bytes required and using file.readinto, rather than file.read().

So I'm left with the problem of getting the bytes into a NumPy array. Below was my first iteration and 87.5% of the time is spent in the if-else block appending numpy arrays.

count = 0
numpy_types = np.dtype([('col1','i8'),('col2','i8'),('col3','i8'),('col4','f4')])
for file in files:
    byte_array = bytearray(file.filelength)
    with open(file, 'rb') as f:
        f.readinto(byte_array)
        array = np.frombuffer(byte_array, numpy_types)
        if count == 0:
            numpy_array = array
        else:
            numpy_array = np.append(numpy_array, array)

In case anyone wants to try this at home, I'll repeat the above example and another attempt with a version you can copy and paste in.


1st attempt

Read each file into an individual numpy array and append them together

import numpy as np
import time
start = time.time()
byte_array = b''
bytes1 = b'\x00\xe8n\x14Z\x1d\xd8\x08\xff\xff\xff\xff\xff\xff\xff\xff\x00\xdd\x90\xa7\x16/\xd8\x08ff\xe0A'
# Create the byte array identical to what would be read in from each file
for i in range(1000):
        byte_array += bytes1



numpy_dtypes = np.dtype([('col1','i8'), ('col2', 'i8'), ('col3', 'i8'), ('col4', 'f4')])
total_time = 0
# Imitate loop of reading in multiple files
for i in range(1000):
    array = np.frombuffer(byte_array, numpy_dtypes)
    start2 = time.time()
    if i == 0:
        numpy_array = array
    else:
        numpy_array = np.append(numpy_array, array)
    total_time += (time.time() - start2)
print(f'took {total_time} to append numpy arrays together')
print(f'took {time.time()-start:.2f} seconds in total')
  • took 12.19652795791626 to append numpy arrays together
  • took 12.21 seconds in total

2nd attempt

I tried appending all the bytes to a single bytearray before reading into a numpy array at once

import numpy as np
import time
start = time.time()
byte_array = b''
bytes1 = b'\x00\xe8n\x14Z\x1d\xd8\x08\xff\xff\xff\xff\xff\xff\xff\xff\x00\xdd\x90\xa7\x16/\xd8\x08ff\xe0A'
# Create the byte array identical to what would be read in from each file
for i in range(1000):
        byte_array += bytes1


numpy_dtypes = np.dtype([('col1','i8'), ('col2', 'i8'), ('col3', 'i8'), ('col4', 'f4')])
# Imitate loop of reading in multiple files
total_bytes = b''
start2 = time.time()
for i in range(1000):
    total_bytes += byte_array
print(f'took {time.time()-start2:.2f} seconds to append bytes together')
numpy_array = np.frombuffer(total_bytes, numpy_dtypes)
print(f'took {time.time()-start:.2f} seconds')
  • took 12.67 seconds to append bytes together
  • took 12.67 seconds.

Why is it that the majority of the processing time comes from appending the data together? Is there a better way to approach this as it seems this is the bottleneck. Either from the appending all the data together or from the initial way everything is being read in. I have also tried struct.unpack however this is still quite slow and from what I am aware, NumPy is quicker at reading bytes.



Solution 1:[1]

Each time you append, you actually create a copy that is the concatenation of two arrays (numpy arrays have fixed size, while bytearrays are in fact immutable). As a result, you copy-paste arrays over and over while they increase in size, which explains the bottleneck that you observe.

Instead of creating new arrays that get bigger on each iteration, it is more efficient to just store a pointer to each array in a list and concatenate them only once in the end, e.g. with the following code:

%%timeit
# timeit function requires the code to be run in a jupyter notebook
for i in range(1000):
    array = np.frombuffer(byte_array, numpy_dtypes)
    start2 = time.time()
    if i == 0:
        numpy_array2 = [array]
    else:
        numpy_array2.append(array)
numpy_array2 = np.concatenate(numpy_array2)

On my system, this reduces the time from ~8 seconds (as opposed to your 12s) to mere milliseconds (with a timeit function):

>>> 5.1 ms ± 170 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Sander