'Concatenating PDF files in memory with PyPDF2
I wish to concatenate (append) a bunch of small pdfs together effectively in memory in pure python. Specifically, an usual case is 500 single page pdfs, each with a size of about 400 kB, to be merged into one. Let's say the pdfs are available as a iterable in memory, say a list:
my_pdfs = [pdf1_fileobj, pdf2_fileobj, ..., pdfn_fileobj] # type is BytesIO
Where each pdf_fileobj is of type BytesIO. Then, the base memory usage is about 200 MB (500 pdfs, 400kB each).
Ideally, I would want the following code to concatenate using no more than 400-500 MB of memory in total (including my_pdfs
). However, that doesn't seem to be the case, the debugging statement on the last line indicates the maximum memory used to be almost 700 MB. Moreover, using the Mac os x resource monitor, the allocated memory is indicated to be 600 MB when reaching the last line.
Running gc.collect()
reduces this to 350 MB (almost too good?). Why do I have to run garbage collection manually to get rid of merging garbage, in this case? I have seen this (probably) causing memory build up in a slightly different scenario I'll skip for now.
import io
import resource # For debugging
from PyPDF2 import PdfFileMerger
def merge_pdfs(iterable):
"""Merge pdfs in memory"""
merger = PdfFileMerger()
for pdf_fileobj in iterable:
merger.append(pdf_fileobj)
myio = io.BytesIO()
merger.write(myio)
merger.close()
myio.seek(0)
return myio
my_concatenated_pdf = merge_pdfs(my_pdfs)
# Print the maximum memory usage
print("Memory usage: %s (kB)" % resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)
Question summary
- Why does the code above need almost 700 MB of memory to merge 200 MB worth of pdfs? Shouldn't 400 MB + overhead be enough? How do I optimize it?
- Why do I need to run garbage collection manually to get rid of PyPDF2 merging junk when the variables in question should already be out of scope?
- What about this general approach? Is BytesIO suitable to use is this case?
merger.write(myio)
does seem to run kind of slow given that all happen in ram.
Thank you!
Solution 1:[1]
Q: Why does the code above need almost 700 MB of memory to merge 200 MB worth of pdfs? Shouldn't 400 MB + overhead be enough? How do I optimise it?
A: Because .append
creates a new stream object and then you use merger.write(myio)
, which creates another stream object and you already have 200 MB of pdf files in memory so 3*200 MB.
Q: Why do I need to run garbage collection manually to get rid of PyPDF2 merging junk when the variables in question should already be out of scope?
A: It is a known issue in PyPDF2.
Q: What about this general approach? Is BytesIO suitable to use is this case?
A: Considering the memory issues, you might want to try a different approach. Maybe merging one by one, temporarily saving the files to disk, then clearing the already merged ones from memory.
Solution 2:[2]
PyMuPdf library may also be a good alternative to the performance issues of PDFMerger
from PyPDF2
.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | spedy |
Solution 2 | Guibod |