'scrapy follow external link with one depth only

Imagine I am crawling foo.com. foo.com has several internal links to itself, and it has some external links like:

foo.com/hello
foo.com/contact
bar.com
holla.com

I would like scrapy to crawl all the internal links but also only one depth for external links such as I want scrapy to go to bar.com or holla.com but I dont want it to go any other link within bar.com so only depth of one.

is this possible? What would be the config for this case?

Thanks.



Solution 1:[1]

You can base your spider on CrawlSpider class and use Rules with implemented process_links method, that you pass to the Rule. That method will filter unwanted links before they get followed. From the documentation:

process_links is a callable, or a string (in which case a method from the spider object with that name will be used) which will be called for each list of links extracted from each response using the specified link_extractor. This is mainly used for filtering purposes.

Solution 2:[2]

I found a solution by passing an argument to callback function. If url is internal link, I set flag to true (otherwise false). And if flag returns false(external link), crawler does not extract new links. Here my sample code:

class BrokenLinksSpider(CrawlSpider):
name = test
start_urls = "your_url"

def parse(self, response):
    flag = response.meta.get('flag')
    if flag or flag==None:
        extractor = LinkExtractor(deny_domains="")
        links = extractor.extract_links(response)
        for link in links:
            if link.url[:8]=="your_url":
                new_request = Request(link.url, callback=self.parse,meta={'flag': True})
            else:
                new_request = Request(link.url, callback=self.parse,meta={'flag': False})
            yield new_request

Solution 3:[3]

Not a builtin solution but I believe you will have to interupt the recursion by yourself. You could easily to that by keeping a array (set) of domains in your spider and interupt or ignore.

Somthing of the sort:

from urllib.parse import urlparse

self.track = set()

...
domain = tracktraurlparse(response.url).netloc
x.add(domain)
if len(x) > MAX_RECURSION:
   x.remove(domain)
   # raise StopIteration (# if you're within a generator)
   return None

Solution 4:[4]

To complement @mcavdar's answer, responses have a Depth attribute at response.meta['depth'] that can be used instead of having to set any flag.

class BrokenLinksSpider(CrawlSpider):
    name = test
    start_urls = "your_url"
    MAX_DEPTH = 1 # 2nd level

    def parse(self, response: Response) -> None:
        current_depth = response.meta["depth"]
        if current_depth < self.MAX_DEPTH:
            extractor = LinkExtractor(deny_domains="")
            links = extractor.extract_links(response)
            for link in links:
                yield Request(link.url, callback=self.parse)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Tomáš Linhart
Solution 2 mcavdar
Solution 3 Maresh
Solution 4 Alvin Sartor