'How to efficiently serialize python dict with known schema to binary?

I have a lot of python dicts with known schema. For example, the schema is defined as Pyspark StructType like this:

from pyspark.sql.types import *
dict_schema = StructType([
        StructField("upload_time", TimestampType(), True),        
        StructField("name", StringType(), True),
        StructField("value", StringType(), True),
    ])

I want to efficiently serialize each dict object into byte array. What serialization method will give me the smallest payload? I don't want to use pickle because the payload is very large (its embedded the schema into each serialized object).



Solution 1:[1]

You can use the built-in struct module. Simply "pack" the values:

import struct
struct.pack('Q10s5s`, time, name, value)

That's assuming time is a 64-bit int, name is at most 10 characters and value is at most 20 characters. You'll need to tune that. You might also considering storing the strings as null-terminated byte sequences if the names and values do not have consistent lengths (you don't want to waste space on padding).

Another good way is using NumPy, assuming the strings have fairly consistent lengths:

import numpy as np
a = np.empty(1000, [('time', 'u8'), ('name', 'S10'), ('value', 'S20')])
np.save(filename, a)

This will include a "schema" of sorts at the top of the file; you could write the raw array without that schema if you really want to.

Solution 2:[2]

i use msgpack (https://msgpack.org/) and order the dict values based on the sorted keys. i use zip to rebuild the keys with the packed values.

to my knowledge msgpack generally has comparable size performance as struct.

in terms of cpu performance, msgpack is 10x slower to encode and 50% slower to decode. msgpack natively handles many formats though, and can be extended with an encoding/decoding hook like JSON which makes things easier.

in terms of speed,...

import time
import msgpack
import struct

_time = int(time.time())
_keys = ('time', 'name', 'value')
_values = (_time, 'name', 'value')

as_msgpack = msgpack.packb(_values, use_bin_type=True)
as_struct = struct.pack('Q10s5s', _time, 'name', 'value')

def test_so_msgpack_encode():
    as_msgpack = msgpack.packb(_values, use_bin_type=True)
    return as_msgpack

def test_so_struct_encode():
    as_struct = struct.pack('Q10s5s', _time, 'name', 'value')
    return as_struct

def test_so_msgpack_decode():
    _decoded = msgpack.unpackb(as_msgpack)
    return dict(zip(_keys, _decoded))

def test_so_struct_decode():
    _decoded = struct.unpack('Q10s5s', as_struct)
    return dict(zip(_keys, _decoded))

print(timeit.timeit("test_so_msgpack_encode()", setup="from __main__ import test_so_msgpack_encode", number=10000))
print(timeit.timeit("test_so_struct_encode()", setup="from __main__ import test_so_struct_encode", number=10000))
print(timeit.timeit("test_so_msgpack_decode()", setup="from __main__ import test_so_msgpack_decode", number=10000))
print(timeit.timeit("test_so_struct_decode()", setup="from __main__ import test_so_struct_decode", number=10000))

In terms of speed, while there is a 10x factor... it's not likely going to be an issue. I ran the above on a 10 year old (2008) computer to illustrate over 10000 iterations:

encoding:

0.0745489597321  # msgpack
0.00702214241028  # struct

decoding:

0.0458550453186  # msgpack
0.0313770771027  # struct

so it's possible to make it run faster with struct, but IMHO msgpack offers more.

note: the above isn't a perfectly equal test; I used the struct unpack above which appears to pad the value a few bytes. that could be fixed in the pack or unpack and would affect the result.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 John Zwinck
Solution 2 Jonathan Vanasco