'scrapy follow external link with one depth only
Imagine I am crawling foo.com. foo.com has several internal links to itself, and it has some external links like:
foo.com/hello
foo.com/contact
bar.com
holla.com
I would like scrapy to crawl all the internal links but also only one depth for external links such as I want scrapy to go to bar.com or holla.com but I dont want it to go any other link within bar.com so only depth of one. 
is this possible? What would be the config for this case?
Thanks.
Solution 1:[1]
You can base your spider on CrawlSpider class and use Rules with implemented process_links method, that you pass to the Rule. That method will filter unwanted links before they get followed. From the documentation:
process_linksis a callable, or a string (in which case a method from the spider object with that name will be used) which will be called for each list of links extracted from each response using the specifiedlink_extractor. This is mainly used for filtering purposes.
Solution 2:[2]
I found a solution by passing an argument to callback function. If url is internal link, I set flag to true (otherwise false). And if flag returns false(external link), crawler does not extract new links. Here my sample code:
class BrokenLinksSpider(CrawlSpider):
name = test
start_urls = "your_url"
def parse(self, response):
    flag = response.meta.get('flag')
    if flag or flag==None:
        extractor = LinkExtractor(deny_domains="")
        links = extractor.extract_links(response)
        for link in links:
            if link.url[:8]=="your_url":
                new_request = Request(link.url, callback=self.parse,meta={'flag': True})
            else:
                new_request = Request(link.url, callback=self.parse,meta={'flag': False})
            yield new_request
Solution 3:[3]
Not a builtin solution but I believe you will have to interupt the recursion by yourself. You could easily to that by keeping a array (set) of domains in your spider and interupt or ignore.
Somthing of the sort:
from urllib.parse import urlparse
self.track = set()
...
domain = tracktraurlparse(response.url).netloc
x.add(domain)
if len(x) > MAX_RECURSION:
   x.remove(domain)
   # raise StopIteration (# if you're within a generator)
   return None
Solution 4:[4]
To complement @mcavdar's answer, responses have a Depth attribute at response.meta['depth'] that can be used instead of having to set any flag.
class BrokenLinksSpider(CrawlSpider):
    name = test
    start_urls = "your_url"
    MAX_DEPTH = 1 # 2nd level
    def parse(self, response: Response) -> None:
        current_depth = response.meta["depth"]
        if current_depth < self.MAX_DEPTH:
            extractor = LinkExtractor(deny_domains="")
            links = extractor.extract_links(response)
            for link in links:
                yield Request(link.url, callback=self.parse)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source | 
|---|---|
| Solution 1 | Tomáš Linhart | 
| Solution 2 | mcavdar | 
| Solution 3 | Maresh | 
| Solution 4 | Alvin Sartor | 
