'How to parse the collections in `mongodump` archive output using Python?


I have a MongoDB that is backed up every day using the following command

mongodump --gzip --numParallelCollections=1 --oplog --archive=/tmp/dump.gz --readPreference=primary

I want to parse this dump file using python only, to get all the underlying BSON documents. I want to convert the BSON into JSON.

What I tried

Let's say I have a single db named my_db and a single collection named my_employees which contains two documents only

{"name": "john doe"}
{"name": "jane doe"}

I dumped this single collection using the following command

mongodump --readPreference=primary --gzip --archive=/tmp/dump.gz --numParallelCollections=1 --db=my_db --collection=my_employees

I gunzip the dump file.

Now I try to parse the file using only python and pymongo. I try to take inspiration from this Go parser.

I don't know Go but what I understood is that the dump file contains zero or more blocks each block has the following structure

terminator_or_size_of_bson: 4 bytes
bson_document: N bytes

Here is the code I came up with (it doesn't handle a lot of things, but it's quick draft)

import bson

dump = open("/tmp/dump", "rb").read() # I `gunziped` the file before
file_size = len(dump)

i = 0
nb_bsons_to_parse = 10 # I try to print the first 10 BSONS
bsons_parsed = 0

while i < file_size and bsons_parsed < nb_bsons_to_parse:
    bson_size = int.from_bytes(dump[i: i+4], "little")
    print("here is the bson_size ", bson_size)
    print("here is the bson_size in bytes ", dump[i: i + 4])
    bson_document_bytes = dump[i+4: i + 4 + bson_size]
    bson_document = bson.decode_all(bson_document_bytes)
    nb_bsons_to_parse += 1
    i += i + 4 + bson_size

here is the error I have

here is the bson_size  2174345837
here is the bson_size in bytes  b'm\xe2\x99\x81'
Traceback (most recent call last):
  File "..../read_bson_from_dump.py", line 27, in <module>
    bson_document = bson.decode_all(bson_document_bytes)
bson.errors.InvalidBSON: invalid message size

You can see that the first four bytes have a value 2174345837 that exceeds the allowed 16MB document size

I used a different BSON API

# ... only this loop changes 
while i < file_size and bsons_parsed < nb_bsons_to_parse:
    bson_size = int.from_bytes(dump[i: i+4], "little")
    print("here is the bson_size ", bson_size)
    print("here is the bson_size in bytes ", dump[i: i + 4])
    bson_document_bytes = dump[i+4: i + 4 + bson_size]
    itr = bson.decode_iter(bson_document_bytes)
    for rec in itr:
        nb_bsons_to_parse += 1
    i += i + 4 + bson_size

And here is the result I have

here is the bson_size  2174345837
here is the bson_size in bytes  b'm\xe2\x99\x81'
{'concurrent_collections': 1, 'version': '0.1', 'server_version': '4.4.13', 'tool_version': '100.5.2'}
{'db': 'my_db', 'collection': 'my_employees', 'metadata': '{"indexes":[{"v":{"$numberInt":"2"},"key":{"_id":{"$numberInt":"1"}},"name":"_id_"}],"uuid":"525124e3292340ce92048df1bc16189c","collectionName":"my_employees","type":"collection"}', 'size': 0, 'type': 'collection'}
Traceback (most recent call last):
  File "..../read_bson_from_dump.py", line 28, in <module>
    for rec in itr:
  File ".../env/lib/python3.9/site-packages/bson/__init__.py", line 1061, in decode_iter
    yield _bson_to_dict(elements, codec_options)
bson.errors.InvalidBSON: not enough data for a BSON document

I don't want to use mongoexport nor mongrestore to parse the archive dump.

Thanks for your help

Solution 1:[1]

I already parse the mongodump archive using nodejs the first 4 bytes of the archive are the magic number not the size of the documents. after the magic number, the first bson data is the archive header with the value of

  concurrent_collections: 4,
  version: '0.1',
  server_version: '4.2.2',
  tool_version: 'r4.2.2'

the next bson data is all your collections with index information, you need to loop this until the first terminator with the value of 0xffffffff then the next bson data is all your documents until the last terminator


This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 francojohnc