'PyPDF2 extract empty Text
I am using PyPDF2 for extract text from pdf. All examples which I found in the google look like my code:
import PyPDF2
reader = PyPDF2.PdfFileReader("test2.pdf")
page = reader.getPage(0)
text = page.extractText()
print(text.encode("utf-8"))
However, I have empty text in my console:
b''
This code I have tested for different pdf and all pdf's were empty
UPD:
# getDocumentInfo
{'/Producer': 'Skia/PDF m75'}
File pdf
Solution 1:[1]
It looks like some font/text combos make the text unreadable by PyPDF2, PyPDF3 or PyPDF4.
To extract the text from these PDFs, you can use the dedicated PDF text extraction package pdfminer.six.
from pdfminer import high_level
local_pdf_filename = "/path/to/pdf/you_want_to_extract_text_from.pdf"
pages = [0] # just the first page
extracted_text = high_level.extract_text(local_pdf_filename, "", pages)
print(extracted_text)
It works on all the pdfs that were failing for me and is super quick to implement as a fallback. Full docs for the extract_text function are here.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | timhj |