'Which parse method scrapy used to parse start_urls
I want scrapy to scrape some start urls and then follow the links in those pages according to rules. My spider is inherited from CrawlSpider
and has start_urls
and 'rules' set. But it doesn't seems to use the parse function I defines to parse the start_urls. Here are the codes:
<!-- language: lang-python -->
class ZhihuSpider(CrawlSpider):
start_urls = ["https://www.zhihu.com/topic/19778317/organize/entire",
"https://www.zhihu.com/topic/19778287/organize/entire"]
rules = (Rule(LinkExtractor(allow= (r'topic/\d+/organize/entire')), \
process_request='request_tagInfoPage', callback = 'parse_tagPage'))
# this is the parse_tagPage() scrapy should use to scrape
def parse_tagPage():
print("start scraping!") # Explicitly print to show that scraping starts
# do_something
However, the console shows that scrapy crawled start_urls but nothing printed. So I am pretty sure that the parse_tagPage() function isn't called. Even though, scrapy shows that the urls is crawled [scrapy] DEBUG: Crawled (200) <GET https://www.zhihu.com/topic/19778317/organize/entire> (referer: http://www.zhihu.com)
Any hints on why this would happen and how to set scrapy to use parse_tagPage()?
Solution 1:[1]
1st, the CrawlSpider class uses a default parse() method to deal with ALL requests that doesn't specifies a callback function, in my case including the requests made from start_urls. This parse() method only applies rules to extract links and doesn't parse the pages of start_url at all. That's why I can't scrape anything from the start_url pages.
2nd, the LinkExtractor somehow only extracts the first links from start_urls pages. And unfortunately, the first links are start_urls themselves. So the scrapy internal duplication-preventing mechanism blocks parsing those pages. That's why the callback function parse_tagPage() is called.
I am working on fixing the LinkExtractor.
Solution 2:[2]
you can overwrite parse_start_url() that to parse start_urls
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Skywalker326 |
Solution 2 | lius |