'Scrape Goodreads.com with Python Scrapy : How to Scrape Next_Page Link That Include Ajax Request

I try to scrape title of the books and all review about books from Cozy Mystery Series .

I have written below code for spider.

import scrapy
from ..items import GoodreadsItem
from scrapy import Request
from urllib.parse import urljoin
from urllib.parse import urlparse
import re



class CrawlnscrapeSpider(scrapy.Spider):
    name = 'crawlNscrape'
    allowed_domains = ['www.goodreads.com']
    start_urls = ['https://www.goodreads.com/list/show/702.Cozy_Mystery_Series_First_Book_of_a_Series']

    def parse(self, response):
        
        
        #collect all book links in this page then make request for 
        #parse_page function
        for href in response.css("a.bookTitle::attr(href)") :
            url=response.urljoin(href.extract())
            yield scrapy.Request(url, callback=self.parse_page)
            
        
        #go to the next page and make request for next page and call parse 
        #function again
        next_page = response.xpath("(//a[@class='next_page'])[1]/@href")
        if next_page:
            url= response.urljoin(next_page[0].extract())
            yield scrapy.Request(url, self.parse)
        
        
            

    def parse_page(self, response):
        
        #call goodreads item and create empty dictionary with name book
        book = GoodreadsItem()
        title = response.css("#bookTitle::text").get()
        reviews = response.css(".readable span:nth-child(2)::text").getall()
        
        #add book and reviews that earned into dictionary
        book['title'] = title
        book['reviews'] = reviews#take all reviews about book in single page
        
        
        # i want to extract all of the review pages for any book ,
        # but there is a ajax request in onclick button
        # so i cant scrape link of next page.
        next_page = response.xpath("(//a[@class='next_page'])[1]/@onclick")
        if next_page:
            url = response.urljoin(next_page[0].extract())
            yield scrapy.Request(url,callback=self.parse_page)

      yield book
            

This code collect links of all books which belong Cozy Mystery Series category . So , i have collected link of almost 985 book . My primary goal is scrape title and all reviews about books . My code easily collected title and review of first pages but i want to scrape all reviews about every book so i should collect link of next_page button of reviews . But this next_page buttons include some AJAX request and i cant use scrapy at this moment . How can i handle this problem ?

onclick="new Ajax.Request('/book/reviews/7061.The_No_1_Ladies_Detective_Agency? 
hide_last_page=true&language_code=en&page=2', {asynchronous:true, 
evalScripts:true, method:'get', parameters:'authenticity_token=' + 
encodeURIComponent('wyYE+RP05rXM9QL8REVMokP0PeCwfC1MNR5UoBAESfks4dEAZ2pTpDBlGFUXN/E1lo68vTB2 
+d36x6yJTYwvxA==')}); return false;"

Above part show the content of next page button for review pages . How can i extract this link of this next_page button

Thanks for help in advance :)



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source