'How to use scrapy to scrape google play reviews of applications?
I wrote this spider to scrape reviews of apps from google play. I am partially successful in this. I am able to extract the name, date, and review only.
My questions:
- How to get all the reviews as I am only getting only 41.
- How to get the rating from the
<div>
?
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from urlparse import urljoin
class CompItem(scrapy.Item):
rating = scrapy.Field()
data = scrapy.Field()
name = scrapy.Field()
date = scrapy.Field()
class criticspider(CrawlSpider):
name = "gaana"
allowed_domains = ["play.google.com"]
start_urls = ["https://play.google.com/store/apps/details?id=com.gaana&hl=en"]
# rules = (
# Rule(
# SgmlLinkExtractor(allow=('search=jabong&page=1/+',)),
# callback="parse_start_url",
# follow=True),
# )
def parse(self, response):
sites = response.xpath('//div[@class="single-review"]')
items = []
for site in sites:
item = CompItem()
item['data'] = site.xpath('.//div[@class="review-body"]/text()').extract()
item['name'] = site.xpath('.//div/div/span[@class="author-name"]/a/text()').extract()[0]
item['date'] = site.xpath('.//span[@class="review-date"]/text()').extract()[0]
item['rating'] = site.xpath('div[@class="review-info-star-rating"]/aria-label/text()').extract()
items.append(item)
return items
Solution 1:[1]
you have
item['rating'] = site.xpath('div[@class="review-info-star-rating"]/aria-label/text()').extract()
should it not be something like:
item['rating'] = site.xpath('.//div[@class="review-info-star-rating"]/aria-label/text()').extract()
?? dunno if it will work, but try :)
Solution 2:[2]
You can try this one out:
item['rating'] = site.xpath('.//div[@class="tiny-star star-rating-non-editable-container"]/@aria-label').extract()
Solution 3:[3]
To parse all reviews, you need to parse the next page token that is located in the <script>
tags and pass it a POST request that you can find in the network tab under DevTools that is called batchexecute?...
Network tab:
Page source, somewhere in the <script>
tags where you see the same page token:
As an alternative solution, you can try to use Google Play Apps Store API from SerpApi. It's a paid API with a free plan which bypasses blocks from Google, scales, and you don't have to create the parser from scratch.
Code and example in the online IDE:
from serpapi import GoogleSearch
from urllib.parse import urlsplit, parse_qsl
import json, os
def serpapi_scrape_all_reviews():
# https://docs.python.org/3/library/os.html#os.getenv
params = {
"api_key": os.getenv("API_KEY"), # your serpapi api key
"engine": "google_play_product", # search engine
"store": "apps",
"gl": "us", # country to search from: Spain
"product_id": "com.nintendo.zara", # app ID
"all_reviews": "true" # show all reviews
}
search = GoogleSearch(params) # where data extraction happens
# page number
index = 0
reviews_is_present = True
while reviews_is_present:
results = search.get_dict() # JSON -> Python dict
# update page number
index += 1
for review in results.get("reviews", []):
print(f"\npage #: {index}\n")
print(json.dumps(review, indent=2))
# grab next page if it's there and pass to to GoogleSearch(), otherwise stop.
if "next" in results.get("serpapi_pagination", []):
search.params_dict.update(dict(parse_qsl(urlsplit(results.get("serpapi_pagination").get("next")).query)))
else:
reviews_is_present = False
serpapi_scrape_all_reviews()
Output:
page #: 1
{
"title": "cohen rigg",
"avatar": "https://play-lh.googleusercontent.com/a-/AOh14Gj8OIwh0fMd13LXsIhaULJtqe39k3HwjH7AIZjTdQ",
"rating": 2.0,
"snippet": "This game is a good game. It's fun to simply pick up and play. However I have two major problems that make me never want to play this game again. 1. Microtransactions. You have to buy the entire game/story mode and wait for tickets to play different modes like remix 10 or toad rally. But you could just pay up and not wait. 2. Wifi problems. Not sure if this is my problem but the game never wants to work. At the end of the day there are better games on mobile worth more of your time.",
"likes": 26,
"date": "April 16, 2022"
}
page #: 13
{
"title": "Caol Huff",
"avatar": "https://play-lh.googleusercontent.com/a-/AOh14Gjsbukt1CmC-F5VWvfIhPA13GOAbAIr3JV40aCnnA",
"rating": 4.0,
"snippet": "Fun, very simple game. No in app purchases. Just a one time purchase to unlock all the levels. Super annoying that you can't play without an internet connection. That means no playing on an airplane.",
"likes": 24,
"date": "April 21, 2021"
} ... other results
Disclaimer, I work for SerpApi.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | InĂªs Martins |
Solution 2 | Pinal |
Solution 3 | Dmitriy Zub |