'Extracting text from PDF url file with Python
I want to extract text from PDF file thats on one website. The website contains link to PDF doc, but when I click on that link it automaticaly downloads that file. Is it possible to extract text from that file without downloading it
import fitz # this is pymupdf lib for text extraction
from bs4 import BeautifulSoup
import requests
from io import StringIO
url = "https://www.blv.admin.ch/blv/de/home/lebensmittel-und-ernaehrung/publikationen-und-forschung/statistik-und-berichte-lebensmittelsicherheit.html"
headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
all_news = soup.select("div.mod.mod-download a")[0]
pdf = "https://www.blv.admin.ch"+all_news["href"]
#https://www.blv.admin.ch/dam/blv/de/dokumente/lebensmittel-und-ernaehrung/publikationen-forschung/jahresbericht-2017-2019-oew-rr-rasff.pdf.download.pdf/Jahresbericht_2017-2019_DE.pdf
This is code for extracting text from pdf. It works good when file is downloaded:
my_pdf_doc = fitz.open(pdf)
text = ""
for page in my_pdf_doc:
text += page.getText()
print(text)
The same question is if link does not downloads the pdf file automatically, for example this link:
"https://amsoldingen.ch/images/files/Bekanntgabe-Stimmausschuss-13.12.2020.pdf"
How can I extract text from that file
I have also tried this:
pdf_content = requests.get(pdf)
print(type(pdf_content.content))
file = StringIO()
print(file.write(pdf_content.content.decode("utf-32")))
But I get error:
Traceback (most recent call last):
File "/Users/aleksandardevedzic/Desktop/pdf extraction scrapping.py", line 25, in <module>
print(file.write(pdf_content.content.decode("utf-32")))
UnicodeDecodeError: 'utf-32-le' codec can't decode bytes in position 0-3: code point not in range(0x110000)
Solution 1:[1]
Here is an example using PyPDF2.
To install
pip install PyPDF2
import requests, PyPDF2
from io import BytesIO
url = 'https://www.blv.admin.ch/dam/blv/de/dokumente/lebensmittel-und-ernaehrung/publikationen-forschung/jahresbericht-2017-2019-oew-rr-rasff.pdf.download.pdf/Jahresbericht_2017-2019_DE.pdf'
response = requests.get(url)
my_raw_data = response.content
with BytesIO(my_raw_data) as data:
read_pdf = PyPDF2.PdfFileReader(data)
for page in range(read_pdf.getNumPages()):
print(read_pdf.getPage(page).extractText())
Output:
' 1/21 Fad \nŒ 24.08.2020\n Bericht 2017\n Œ 2019: Öffentliche Warnungen, \nRückrufe und Schnellwarnsystem RASFF\n '
Solution 2:[2]
PyMuPDF allows us to open a BytesIO stream directly, as mentioned in the documentation.
import requests
import fitz
import io
url = "your-url.pdf"
request = requests.get(url)
filestream = io.BytesIO(request.content)
pdf = fitz.open(stream=filestream, filetype="pdf")
pdf
can then be parsed like a regular PyMuPDF document, as shown here.
P.S. This is my first answer on Stack Overflow, and any improvements/suggestions are welcome.
Solution 3:[3]
It is IMPOSSIBLE to read a web application/pdf file that is at a remote location such as a server without "Download". The browser / reader / text extractor is local and HTTPS security requires the file is worked as Hyper Text Transferred locally (unless the server is unlikely configured specifically to allow client administrative edits of its served files).
BOTH your example links instantly download in My browser, since my browser user settings ares set to securely download only NOT run exploitable view in browser.
Thus to extract text you get a temporary copy in local device file system memory (this often uses hard drive cache) and others have suggested that can be done using Python FileStream IO. However that is not much different to how a download works.
The file can be transferred using memory to temporary IO as efficient File Bytes using
Curl -O https://www.blv.admin.ch/dam/blv/de/dokumente/lebensmittel-und-ernaehrung/publikationen-forschung/jahresbericht-2017-2019-oew-rr-rasff.pdf.download.pdf/Jahresbericht_2017-2019_DE.pdf
then use the related Python OS command(s)
pdftotext Jahresbericht_2017-2019_DE.pdf | Find "whatever you need"
Solution 4:[4]
I have done @Vihaan Thora solution it worked for me
!pip install PyMuPDF
import requests
import fitz
import io
url = "https://www.livelaw.in/pdf_upload/vsa02052022matfc1162021145829-416435.pdf"
request = requests.get(url)
filestream = io.BytesIO(request.content)
with fitz.open(stream=filestream, filetype="pdf") as doc:
detail_judgement = ""
for page in doc:
detail_judgement += page.get_text()
print(detail_judgement)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | |
Solution 3 | |
Solution 4 | PlutoSenthil |