'Google play review scraping changes

Over the past year or so I have created a number of scripts to scrape Android app reviews from Google Play. In the past this was working fine by mimicking the Google Play interface to call https://play.google.com/store/getreviews with the necessary parameters and parse the HTML results.

The recent updates to the Google Play interface changed the HTML structure, but also seems to implement some kind of protection against scraping. There is now a "token" parameter which changes, presumably some kind of session ID, and which I have not been able to generate as I'm not sure of what seeds it. Also I've found that it seems to block requesting clients that make multiple calls that don't conform to the interface, as after an unsuccessful call I can't even load the Google Play interface in any browser. After a while this seems to time out. Not certain of this, but it's what I've concluded from what I'm seeing.

Anyone found this similar problem, and found a way around it?

Thanks



Solution 1:[1]

In order to parse reviews data, you need to parse it from the inline <script> tags via regular expression. And then use a regular expression to parse user name, avatar, comment, and so on.

To parse all reviews, you need to parse the next page token via regular expression as well and pass it to a POST request instead of GET.

Network tab:

enter image description here

Page source, somewhere in the <script> tags where you see the same page token:

enter image description here


Code and example in the online IDE to scrape the first 40 reviews:

# everything here could be refactored to look more simplifed

from bs4 import BeautifulSoup
import requests, lxml, re, json
from datetime import datetime

# user-agent headers to act as a "real" user visit
headers = {
    "user-agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36"
}

# search query params
params = {
    "id": "com.nintendo.zara",  # app name
    "gl": "ES"                  # country
}


html = requests.get("https://play.google.com/store/apps/details", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")

# temporary store user comments
app_user_comments = []

# https://regex101.com/r/SrP5DS/1
app_user_reviews_data = re.findall(r"(\[\"gp.*?);</script>",
                                str(soup.select("script")), re.DOTALL)

for review in app_user_reviews_data:
    # https://regex101.com/r/M24tiM/1
    user_name = re.findall(r"\"gp:.*?\",\s?\[\"(.*?)\",", str(review))
    
    # https://regex101.com/r/TGgR45/1
    user_avatar = [avatar.replace('"', "") for avatar in re.findall(r"\"gp:.*?\"(https.*?\")", str(review))]

    # replace single/double quotes at the start/end of a string
    # https://regex101.com/r/iHPOrI/1
    user_comments = [comment.replace('"', "").replace("'", "") for comment in
                    re.findall(r"gp:.*?https:.*?]]],\s?\d+?,.*?,\s?(.*?),\s?\[\d+,", str(review))]

    # https://regex101.com/r/Z7vFqa/1
    user_comment_app_rating = re.findall(r"\"gp.*?https.*?\],(.*?)?,", str(review))
    
    # https://regex101.com/r/jRaaQg/1
    user_comment_likes = re.findall(r",?\d+\],?(\d+),?", str(review))
    
    # comment utc timestamp
    # use datetime.utcfromtimestamp(int(date)).date() to have only a date
    user_comment_date = [str(datetime.utcfromtimestamp(int(date))) for date in re.findall(r"\[(\d+),", str(review))]
    
    # https://regex101.com/r/GrbH9A/1
    user_comment_id = [ids.replace('"', "") for ids in re.findall(r"\[\"(gp.*?),", str(review))]
    
    for index, (name, avatar, comment, date, comment_id, likes, user_app_rating) in enumerate(zip(
        user_name,
        user_avatar,
        user_comments,
        user_comment_date,
        user_comment_id,
        user_comment_likes,
        user_comment_app_rating), start=1):

        app_user_comments.append({
            "position": index,
            "name": name,
            "avatar": avatar,
            "comment": comment,
            "app_rating": user_app_rating,
            "comment_likes": likes,
            "comment_published_at": date,
            "comment_id": comment_id
        })
        
print(json.dumps(app_user_comments, indent=2, ensure_ascii=False))

Part of the output:

]
  {
    "position": 1,
    "user_name": "cohen rigg",
    "user_avatar": "https://play-lh.googleusercontent.com/a-/AOh14Gj8OIwh0fMd13LXsIhaULJtqe39k3HwjH7AIZjTdQ",
    "comment": "This game is a good game. Its fun to simply pick up and play. However I have two major problems that make me never want to play this game again. 1. Microtransactions. You have to buy the entire game/story mode and wait for tickets to play different modes like remix 10 or toad rally. But you could just pay up and not wait. 2. Wifi problems. Not sure if this is my problem but the game never wants to work. At the end of the day there are better games on mobile worth more of your time.",
    "app_rating": "2",
    "comment_likes": "22",
    "comment_published_at": "2022-04-16 09:01:36",
    "comment_id": "gp:AOqpTOGljqOaIofAehHYtGN2ay-hnYEigfYD4hgzPoLseth5l-BzPn-RaIShuKzakplra0V1E3KJIu-AfsG5mA"
  }, ... other results
  {
    "position": 40,
    "user_name": "Claire Barrett",
    "user_avatar": "https://play-lh.googleusercontent.com/a-/AOh14GgeG3YaXc7tvjnl7kom2vYaTm4lXwS8UEDiZZV4BA",
    "comment": "After purchasing this is a super fun game, the game modes are fun and is a well executed idea. That is if you can get a \\good\\ network connection. Consistently when playing remix 10 I will get error after error and need to close the game, swipe it out of my recents, open it and wait for it to load all over again. A minor detail on top of this is the game is incredibly loud, I cant listen with music because the sound effects completely cover any music I put on.",
    "app_rating": "3",
    "comment_likes": "28",
    "comment_published_at": "2019-01-05 07:01:35",
    "comment_id": "gp:AOqpTOEOm_ilgrrHynfDLHvEusMHgvXtlwjSY-7SHBxH1Z-jgQQF62TRcFU4TQBQsFBaN1hNid3-yufUOV4IcQ"
  }
]

Alternatively, you can do it using Google Play Product API from SerpApi. It's a paid API with a free plan.

The difference is that you don't have to figure out how to parse the data and then maintain the parser over time. Figure out how to bypass blocks from Google or figure out how to implement pagination. Check out the playground.

Example code to integrate to parse first 40 results:

from serpapi import GoogleSearch
import json

params = {
  "api_key": "API_KEY",                # your serpapi api key
  "engine": "google_play_product",     # search engine
  "store": "apps",                     
  "gl": "es",                          # country to search from: Spain
  "product_id": "com.nintendo.zara",   # app ID
  "all_reviews": "true"                # show all reviews
}

search = GoogleSearch(params)          # where data extraction happens
results = search.get_dict()            # JSON -> Python dict

for review in results["reviews"]:
    print(json.dumps(review, indent=2))

Part of the output:

{
  "title": "cohen rigg",
  "avatar": "https://play-lh.googleusercontent.com/a-/AOh14Gj8OIwh0fMd13LXsIhaULJtqe39k3HwjH7AIZjTdQ",
  "rating": 2.0,
  "snippet": "This game is a good game. It's fun to simply pick up and play. However I have two major problems that make me never want to play this game again. 1. Microtransactions. You have to buy the entire game/story mode and wait for tickets to play different modes like remix 10 or toad rally. But you could just pay up and not wait. 2. Wifi problems. Not sure if this is my problem but the game never wants to work. At the end of the day there are better games on mobile worth more of your time.",
  "likes": 22,
  "date": "April 16, 2022"
} ... other results
{
  "title": "Claire Barrett",
  "avatar": "https://play-lh.googleusercontent.com/a-/AOh14GgeG3YaXc7tvjnl7kom2vYaTm4lXwS8UEDiZZV4BA",
  "rating": 3.0,
  "snippet": "After purchasing this is a super fun game, the game modes are fun and is a well executed idea. That is if you can get a \"good\" network connection. Consistently when playing remix 10 I will get error after error and need to close the game, swipe it out of my recents, open it and wait for it to load all over again. A minor detail on top of this is the game is incredibly loud, I can't listen with music because the sound effects completely cover any music I put on.",
  "likes": 28,
}

To implement pagination you can do like this:

from serpapi import GoogleSearch
import json
from urllib.parse import urlsplit, parse_qsl


params = {
  "api_key": "API_KEY",                # your serpapi api key
  "engine": "google_play_product",     # search engine
  "store": "apps",                     
  "gl": "es",                          # country to search from: Spain
  "product_id": "com.nintendo.zara",   # app ID
  "all_reviews": "true"                # show all reviews
}

search = GoogleSearch(params)

# just to track what page is currently parsed
index = 0

reviews_is_present = True
while reviews_is_present:
    results = search.get_dict()        # JSON -> Python dict

    # update page number
    index += 1
    for review in results.get("reviews", []):
        
        print(f"\npage #: {index}\n")
        print(json.dumps(review, indent=2))
        
    if "next" in results.get("serpapi_pagination", []):
        search.params_dict.update(dict(parse_qsl(urlsplit(results.get("serpapi_pagination").get("next")).query)))
    else:
        reviews_is_present = False

Part of the output:


page #: 1

{
  "title": "cohen rigg",
  "avatar": "https://play-lh.googleusercontent.com/a-/AOh14Gj8OIwh0fMd13LXsIhaULJtqe39k3HwjH7AIZjTdQ",
  "rating": 2.0,
  "snippet": "This game is a good game. It's fun to simply pick up and play. However I have two major problems that make me never want to play this game again. 1. Microtransactions. You have to buy the entire game/story mode and wait for tickets to play different modes like remix 10 or toad rally. But you could just pay up and not wait. 2. Wifi problems. Not sure if this is my problem but the game never wants to work. At the end of the day there are better games on mobile worth more of your time.",
  "likes": 22,
  "date": "April 16, 2022"
} ... other results

page #: 3

{
  "title": "Abbas Katebi",
  "avatar": "https://play-lh.googleusercontent.com/a/AATXAJx8y5Om_FMp3cpzCcQFlgSE7BYngAM6xtyZDuME=mo",
  "rating": 1.0,
  "snippet": "I purchased the game but the restore purchase button doesn't work and it says you have no content can be restored I have been trying to play world 2 for 8 days but still can't access to the full game I have been sending inquiries for 8 days but every time I sent inquiries they said restart the app how many times should I say I restarted the app for many times and it doesn't work solve my problem or give my money back I wonder why I bought the game it's support doesn't care about its customers",
  "likes": 29,
  "date": "March 10, 2022"
} ... other results

Disclaimer, I work for SerpApi.

Solution 2:[2]

Give this a try: www.scrape4me.com

It does show an error but it outpouts content:

http://scrape4me.com/api?url=https%3A%2F%2Fplay.google.com%2Fstore%2Fapps%2Fdetails%3Fid%3Dcom.com2us.golfstarworldtour.normal.freefull.google.global.android.common&elm=&ch=ch

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Dmitriy Zub
Solution 2 Youss