'Render PDF into an image (self-contained, no external command line dependencies) (to use on AWS Lambda)

I need a simple python library to convert PDF to image (render the PDF as is), but after hours of searching, I keep hitting the same wall, I find libraries like pdf2image python library (and many similar ones), which depend on external applications or wrap command-line tools.

Although there are workarounds to allow using these libraries in serverless settings, they all would complicate our deployment and require creating the likes of Execution Environments or extra lambda layers, which will eat up from the small allowed lambda size.

Is there a self-contained, independent mechanism (not dependent on command-line tools) to allow achieving this (seemingly simple) task?

Also, I am wondering, is there a reason (licensing or patents) for the scarcity of tools that deal with PDFs (they are mostly commercial or under strict AGPL licenses)?



Solution 1:[1]

You said "Ended up using pdf2image"

pdf2image (MIT). A python (3.6+) module that wraps pdftoppm (GPL?) and pdftocairo (GPL?) to convert PDF to a PIL Image object.

Generally Poppler (GPL) spinoffs from Open Source Xpdf (GPL) which has

  • pdftopng:
  • pdftoppm:
  • pdfimages:

and a 3rd party pdftotiff

Solution 2:[2]

You can convert PDF's to images without external dependencies using PyMuPDF. I use it for Azure functions.

Install with pip install PyMuPDF

In your python file:

import fitz
pdfDoc = fitz.open(filepath)
img = pdfDoc[0].get_pixmap(matrix=fitz.Matrix(2,2))
bytesimg = img.tobytes()

This takes the first page of the PDF and converts it to an image, the matrix is for the resolution.

You can also open a stream instead of a file on disk:

pdfDoc = fitz.open(stream = pdfstream, filetype="pdf")

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 Jacob-Jan Mosselman