'Scrapy: Can't Crawling App store Reviews Page

Hi guys I'm having some issues to get data from this page from app store: app store reviewshttps://apps.apple.com/us/app/mathy-cool-math-learner-games/id1476596747#see-all/reviews

I want to retrieve at first a string showing the ratings that a user give for the app. And they are inside of a figure tag with class = "we-star-rating ember-view we-customer-review__rating we-star-rating--large" and is the name of attribute @aria-label.

app reviews page

Here is my code:

from scrapy import Selector
import requests

html = requests.get('https://apps.apple.com/us/app/mathy-cool-math-learner-games/id1476596747#see-all/reviews').content

sel = Selector(text = html)

sel.xpath('//figure[@class="we-star-rating ember-view we-customer-review__rating we-star-rating--large"]/@aria-label').extract()

But it is returning only the first 3 matches:

['5 out of 5', '5 out of 5', '5 out of 5']

What I want is retrieve all the ratings available from all reviews in that page.

Can someone give me a clue with it?



Solution 1:[1]

The Problem

The top three reviews are loaded as part of the HTML but the rest are loaded by javascript. Which is why you're only getting the first three results.

I'm not entirely sure whether this is the whole code you have for using scrapy. I'd be interested in why you choose that part of scrapy.

So deaing with javascript is a huge part of web scraping modern websites. I'm not entirely sure whether you're primarily using scrapy to webscrape. There are a few options to handle javascript with scrapy though.

Information on dynamic Web Scraping

First knowing that websites these days grab information on the fly, using javascript to invoke HTTP requests, called an AJAX request (Asynchronous Javascript and XHTML). This makes either a post or get HTTP request to an API/server and that HTTP response gives back information. In this case, they have preloaded 3 results into the HTML but asked for the rest of the reviews on loading the page with javascript.

In general there are two ways to deal with javascript orientated websites.

  1. Re-engineer the HTTP requests - This is the most efficient way to get the data you want. You want to mimic these HTTP requests that javascript is invoking. If you can do that and this sometimes requires you to post headers, parameters and cookies then you can get the data you want.
  2. Using some form of browser automation. Selenium is the package of choice, although never meant to be used in this way originally. It's slow, inefficient and brittle if using larger datasets.

Solution

For your particular website, you can re-engineer the HTTP requests to get the information you want. This is the ideal situation.

But how did I know that? Well one of the things you can do in chrome is turn off javascript. You have to inspect the page and go to the settings (click the three dots on the very right hand side of the page -> more tools -> settings). Refresh the page without javascript. You'll see there is only three reviews to see.

To understand what's happening using the ChromeDev Tools is very informative. If you go to the network tab when you right-click and inspect the page you'll see all the requests and responses made of the server. Going to the XHR tab is where you will find requests made that has the data you want. Here you have a bunch of requests and the responses.

See the pic below, I've inspected the page, gone to network and refreshed the page. This records the activity of the browser's requests and responses.

Network Tools

You can see there are about 6 requests, 5 GET requests and one POST requests. If you click each request you'll see a pop up box on the right hand side with the request data, preview and response.

enter image description here

Here I've clicked the first request, I've clicked preview, and you can see if you click through that there's some reviews.

I can see in the HTTP request for that data there's an offset of 10, which means it's grabbing the next 10 requests.

enter image description here

So I'm going to alter that offset to see if I can get the first 10 and then the second 10 (There's 20 reviews on this page).

Without having to manually input the parameters and headers etc... You can copy the request into a CURL. This can then be converted using a site like curl.trillworks.com into a nice python format.

enter image description here

Now it's worth while looking at the preview data, because you're going to have to use requests to process this. You're going to end up with a JSON object, you can tell this from the accept part of the HTTP get request is application/json.

So having copied this request into curl.trillworks.com. We have the below.

Coding Example

import requests

headers = {
    'Accept': 'application/json',
    'Referer': 'https://apps.apple.com/us/app/mathy-cool-math-learner-games/id1476596747',
    'Authorization': 'Bearer eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCIsImtpZCI6IldlYlBsYXlLaWQifQ.eyJpc3MiOiJBTVBXZWJQbGF5IiwiaWF0IjoxNTk2NTc1NTY4LCJleHAiOjE2MTIxMjc1Njh9.jnEuBNEVWhKGqI10W6dfhJFtYJtd74Nbu1NueZrPgYjU2K34LwXPQClcus8S9Jit5ayK5MOr0bIpcDx821RI4Q',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36',
    'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
}

params = (
    ('l', 'en-US'),
    ('offset', '1'),
    ('platform', 'web'),
    ('additionalPlatforms', 'appletv,ipad,iphone,mac'),
)

response = requests.get('https://amp-api.apps.apple.com/v1/catalog/us/apps/1476596747/reviews', headers=headers params=params)
response.json()

I mentioned above that sometimes you need headers and parameters. You can play about with the request get method here, see what gets you the data. In this case, you need the parameters and the headers. This is not always the case and therefore you should always do a simple request.get() without anything, then build it up. The json() method formats the json object into a python dictionary so we can access the data easily.

Now when I said looking at the preview, gives us the key's and values we need to access the data. Sometimes the data can be nested very deeply. In this case it's not, so we now think about looping over this dictionary to get all the data we want. We have to make two HTTP requests, one with the offset of 0 and one with offset at 10 to get the full 20 star ratings.

Final Code Example

import requests

headers = {
    'Accept': 'application/json',
    'Referer': 'https://apps.apple.com/us/app/mathy-cool-math-learner-games/id1476596747',
    'Authorization': 'Bearer eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCIsImtpZCI6IldlYlBsYXlLaWQifQ.eyJpc3MiOiJBTVBXZWJQbGF5IiwiaWF0IjoxNTk2NTc1NTY4LCJleHAiOjE2MTIxMjc1Njh9.jnEuBNEVWhKGqI10W6dfhJFtYJtd74Nbu1NueZrPgYjU2K34LwXPQClcus8S9Jit5ayK5MOr0bIpcDx821RI4Q',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36',
    'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
}
for i in range(0,20,10):
    
    params = (
        ('l', 'en-US'),
        ('offset', f'{i}'),
        ('platform', 'web'),
        ('additionalPlatforms', 'appletv,ipad,iphone,mac'),
    )
    response = requests.get('https://amp-api.apps.apple.com/v1/catalog/us/apps/1476596747/reviews', params=params)
    data =  response.json()['data']
    for a in data:
        print(a['attributes']['rating'])

Output

5
5
5
5
5
...

Explanation of Code

  1. We use the headers needed to make the request (I managed to get the data without it, but when trying to get all data, seemed to want the headers, you can play about with this though)

  2. We are looping the parameters needed, we want to have the offset of 0, 10 in this case. So we use range(0,20,10) to get that. With each parameters we make a HTTP get request with the headers and those specific parameters.

  3. We convert the response into a python dictionary using response.json()

  4. If you output this we can see lots of data, we need to be within the data key, like below. If you print this you get the output below.

    data = response.json()['data'] print(data)

Output

{'id': '5632394152',
 'type': 'user-reviews',
 'attributes': {'review': "I really like it. I have been using a memory trainer, but this one is a little bit more fun. I had to learn my skills the hard way so to speak, but it is really fun! I can't wait to buy it again! ??",
  'rating': 5,
  'title': 'Cow force than',
  'date': '2020-03-08T11:37:29Z',
  'userName': 'MY VICELER',
  'isEdited': False}}
  1. So you can see that actually the data we want is behind the attributes key and then the rating key. Hence We want to loop over response.json()['data'])and access a['attribute']['value'] which gives us the output.

Solution 2:[2]

You can use

  1. http://itunes.apple.com/lookup?id=APPID
  2. https://itunes.apple.com/rss/customerreviews/id=APPID/sortBy=mostRecent/json

Those are APIs that Apple provide and they are much better than scraping the website directly.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 AaronS
Solution 2 Mikrasya