'How to use Scrapy to parse PDFs?

I would like to download all PDFs found on a site, e.g. https://www.stadt-koeln.de/politik-und-verwaltung/bekanntmachungen/amtsblatt/index.html. I also tried to use rules but I think it's not neccessary here.

This is my approach:

import scrapy
from scrapy.linkextractors import IGNORED_EXTENSIONS
CUSTOM_IGNORED_EXTENSIONS = IGNORED_EXTENSIONS.copy()
CUSTOM_IGNORED_EXTENSIONS.remove('pdf')

class PDFParser(scrapy.Spider):
    name = 'stadt_koeln_amtsblatt'

    # URL of the pdf file
    start_urls = ['https://www.stadt-koeln.de/politik-und-verwaltung/bekanntmachungen/amtsblatt/index.html']

    rules = (
        Rule(LinkExtractor(allow=r'.*\.pdf', deny_extensions=CUSTOM_IGNORED_EXTENSIONS), callback='parse', follow=True),
    )

    def parse(self, response):
        # selector of pdf file.
        for pdf in response.xpath("//a[contains(@href, 'pdf')]"):
            yield scrapy.Request(
                url=response.urljoin(pdf),
                callback=self.save_pdf
            )

    def save_pdf(self, response):
        path = response.url.split('/')[-1]
        self.logger.info('Saving PDF %s', path)
        with open(path, 'wb') as f:
            f.write(response.body)

It seems there are two problems. The first one when extracting all the pdf links with xpath:

TypeError: Cannot mix str and non-str arguments

and the second problem is about handling the pdf file itself. I just want to store it locally in a specific folder or similar. It would be really great if someone has a working example for this kind of site.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source