'How to parse the collections in `mongodump` archive output using Python?
Context
I have a MongoDB that is backed up every day using the following command
mongodump --gzip --numParallelCollections=1 --oplog --archive=/tmp/dump.gz --readPreference=primary
I want to parse this dump file using python only, to get all the underlying BSON documents. I want to convert the BSON into JSON.
What I tried
Let's say I have a single db named my_db
and a single collection named my_employees
which contains two documents only
{"name": "john doe"}
{"name": "jane doe"}
I dumped this single collection using the following command
mongodump --readPreference=primary --gzip --archive=/tmp/dump.gz --numParallelCollections=1 --db=my_db --collection=my_employees
I gunzip
the dump file.
Now I try to parse the file using only python
and pymongo
. I try to take inspiration from this Go parser.
I don't know Go but what I understood is that the dump file contains zero or more blocks each block has the following structure
terminator_or_size_of_bson: 4 bytes
bson_document: N bytes
Here is the code I came up with (it doesn't handle a lot of things, but it's quick draft)
import bson
dump = open("/tmp/dump", "rb").read() # I `gunziped` the file before
file_size = len(dump)
i = 0
nb_bsons_to_parse = 10 # I try to print the first 10 BSONS
bsons_parsed = 0
while i < file_size and bsons_parsed < nb_bsons_to_parse:
bson_size = int.from_bytes(dump[i: i+4], "little")
print("here is the bson_size ", bson_size)
print("here is the bson_size in bytes ", dump[i: i + 4])
bson_document_bytes = dump[i+4: i + 4 + bson_size]
bson_document = bson.decode_all(bson_document_bytes)
print(bson_document)
nb_bsons_to_parse += 1
i += i + 4 + bson_size
here is the error I have
here is the bson_size 2174345837
here is the bson_size in bytes b'm\xe2\x99\x81'
Traceback (most recent call last):
File "..../read_bson_from_dump.py", line 27, in <module>
bson_document = bson.decode_all(bson_document_bytes)
bson.errors.InvalidBSON: invalid message size
You can see that the first four bytes have a value 2174345837
that exceeds the allowed 16MB document size
I used a different BSON API
# ... only this loop changes
while i < file_size and bsons_parsed < nb_bsons_to_parse:
bson_size = int.from_bytes(dump[i: i+4], "little")
print("here is the bson_size ", bson_size)
print("here is the bson_size in bytes ", dump[i: i + 4])
bson_document_bytes = dump[i+4: i + 4 + bson_size]
itr = bson.decode_iter(bson_document_bytes)
for rec in itr:
print(rec)
nb_bsons_to_parse += 1
i += i + 4 + bson_size
And here is the result I have
here is the bson_size 2174345837
here is the bson_size in bytes b'm\xe2\x99\x81'
{'concurrent_collections': 1, 'version': '0.1', 'server_version': '4.4.13', 'tool_version': '100.5.2'}
{'db': 'my_db', 'collection': 'my_employees', 'metadata': '{"indexes":[{"v":{"$numberInt":"2"},"key":{"_id":{"$numberInt":"1"}},"name":"_id_"}],"uuid":"525124e3292340ce92048df1bc16189c","collectionName":"my_employees","type":"collection"}', 'size': 0, 'type': 'collection'}
Traceback (most recent call last):
File "..../read_bson_from_dump.py", line 28, in <module>
for rec in itr:
File ".../env/lib/python3.9/site-packages/bson/__init__.py", line 1061, in decode_iter
yield _bson_to_dict(elements, codec_options)
bson.errors.InvalidBSON: not enough data for a BSON document
I don't want to use mongoexport
nor mongrestore
to parse the archive dump.
Thanks for your help
Solution 1:[1]
I already parse the mongodump archive using nodejs the first 4 bytes of the archive are the magic number not the size of the documents. after the magic number, the first bson data is the archive header with the value of
{
concurrent_collections: 4,
version: '0.1',
server_version: '4.2.2',
tool_version: 'r4.2.2'
}
the next bson data is all your collections with index information, you need to loop this until the first terminator with the value of 0xffffffff then the next bson data is all your documents until the last terminator
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | francojohnc |