'Scrapy: scraping large PDF files without keeping response body in memory

Let's say I want to scrape a PDF of 1GB with Scrapy, then using the scraped PDF data in further Requests down the line.. how do I do this without keeping the 1GB response body in memory?

(pseudo code:)

class MySpider:

    def start_requests(self):
        return Request('https://my-large-pdf.pdf', self.parse_pdf)

    def parse_pdf(self, pdf_response):
        some_calculated_value = my_pdf_reader(pdf_response.body)
        for i in range(1000):
            yield Request(f'another-interesting-file-{i}.html', self.parse_interesting_file, cb_kwargs=dict(some_calculated_value=some_calculated_value))

    def parse_interesting_file(self, file_response, some_calculated_value=None):
        title = file_response.css('h1')
        yield { title: f'{title} {some_calculated_value}' }

The 1GB stays in memory until all items from the 1000 interesting-file's are scraped, while I no longer need the PDF response body (only the calculated value that I'm passing down).



Solution 1:[1]

Instead of loading whole pdf file in parse_pdf function, try loading the pdf file inside start_requests and then make a "python generator" for getting urls/patterns from the pdf file and branching further with usual 'yield Request(...)' method of scrapy. This should save you from loading whole pdf file in to the memory and you will still have access to it content.

A part from above suggestions, did you try getting content from pdf once and storing it in csv file instead for late use? This should save you from all troubles that you are having with huge pdf file(s).

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Janib Soomro