'scrapy follow external link with one depth only
Imagine I am crawling foo.com
. foo.com has several internal links to itself, and it has some external links like:
foo.com/hello
foo.com/contact
bar.com
holla.com
I would like scrapy to crawl all the internal links but also only one depth for external links such as I want scrapy to go to bar.com
or holla.com
but I dont want it to go any other link within bar.com
so only depth of one.
is this possible? What would be the config for this case?
Thanks.
Solution 1:[1]
You can base your spider on CrawlSpider
class and use Rule
s with implemented process_links
method, that you pass to the Rule
. That method will filter unwanted links before they get followed. From the documentation:
process_links
is a callable, or a string (in which case a method from the spider object with that name will be used) which will be called for each list of links extracted from each response using the specifiedlink_extractor
. This is mainly used for filtering purposes.
Solution 2:[2]
I found a solution by passing an argument to callback function. If url is internal link, I set flag to true (otherwise false). And if flag returns false(external link), crawler does not extract new links. Here my sample code:
class BrokenLinksSpider(CrawlSpider):
name = test
start_urls = "your_url"
def parse(self, response):
flag = response.meta.get('flag')
if flag or flag==None:
extractor = LinkExtractor(deny_domains="")
links = extractor.extract_links(response)
for link in links:
if link.url[:8]=="your_url":
new_request = Request(link.url, callback=self.parse,meta={'flag': True})
else:
new_request = Request(link.url, callback=self.parse,meta={'flag': False})
yield new_request
Solution 3:[3]
Not a builtin solution but I believe you will have to interupt the recursion by yourself. You could easily to that by keeping a array (set) of domains in your spider and interupt or ignore.
Somthing of the sort:
from urllib.parse import urlparse
self.track = set()
...
domain = tracktraurlparse(response.url).netloc
x.add(domain)
if len(x) > MAX_RECURSION:
x.remove(domain)
# raise StopIteration (# if you're within a generator)
return None
Solution 4:[4]
To complement @mcavdar's answer, responses have a Depth attribute at response.meta['depth']
that can be used instead of having to set any flag.
class BrokenLinksSpider(CrawlSpider):
name = test
start_urls = "your_url"
MAX_DEPTH = 1 # 2nd level
def parse(self, response: Response) -> None:
current_depth = response.meta["depth"]
if current_depth < self.MAX_DEPTH:
extractor = LinkExtractor(deny_domains="")
links = extractor.extract_links(response)
for link in links:
yield Request(link.url, callback=self.parse)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Tomáš Linhart |
Solution 2 | mcavdar |
Solution 3 | Maresh |
Solution 4 | Alvin Sartor |