'Which parse method scrapy used to parse start_urls

I want scrapy to scrape some start urls and then follow the links in those pages according to rules. My spider is inherited from CrawlSpider and has start_urls and 'rules' set. But it doesn't seems to use the parse function I defines to parse the start_urls. Here are the codes:

<!-- language: lang-python --> 
class ZhihuSpider(CrawlSpider):

    start_urls = ["https://www.zhihu.com/topic/19778317/organize/entire",
        "https://www.zhihu.com/topic/19778287/organize/entire"]

    rules = (Rule(LinkExtractor(allow= (r'topic/\d+/organize/entire')), \
            process_request='request_tagInfoPage', callback = 'parse_tagPage'))

    # this is the parse_tagPage() scrapy should use to scrape
    def parse_tagPage():
        print("start scraping!") # Explicitly print to show that scraping starts
        # do_something

However, the console shows that scrapy crawled start_urls but nothing printed. So I am pretty sure that the parse_tagPage() function isn't called. Even though, scrapy shows that the urls is crawled [scrapy] DEBUG: Crawled (200) <GET https://www.zhihu.com/topic/19778317/organize/entire> (referer: http://www.zhihu.com)

Any hints on why this would happen and how to set scrapy to use parse_tagPage()?



Solution 1:[1]

1st, the CrawlSpider class uses a default parse() method to deal with ALL requests that doesn't specifies a callback function, in my case including the requests made from start_urls. This parse() method only applies rules to extract links and doesn't parse the pages of start_url at all. That's why I can't scrape anything from the start_url pages.

2nd, the LinkExtractor somehow only extracts the first links from start_urls pages. And unfortunately, the first links are start_urls themselves. So the scrapy internal duplication-preventing mechanism blocks parsing those pages. That's why the callback function parse_tagPage() is called.

I am working on fixing the LinkExtractor.

Solution 2:[2]

you can overwrite parse_start_url() that to parse start_urls

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Skywalker326
Solution 2 lius