'Scraping google play reviews

I am new to programming and I have recently tried to scrape google play reviews with python using the following program:

from bs4 import BeautifulSoup
import urllib.request

url = input("Enter URL: ")
open_url = urllib.request.urlopen(url)

soup = BeautifulSoup(open_url, "html.parser")

reviews = []
for i in soup.find_all("div", {"jscontroller" : "X"}, {"class" : "X"}):
    per_review = i.find("X")
    reviews.append(per_review)

print(reviews)  

The problem is in this section:

for i in soup.find_all("div", {"jscontroller" : "X"}, {"class" : "X"}):
    per_review = i.find("X")
    reviews.append(per_review) 

I have tried with many parent nodes and the current nodes containing the reviews but the output is always an empty list. Could somebody demonstrate how to achieve what i was intending to? Thanks.

Edit

For example, if I use the URL for Super Mario Run with the following parameters:

reviews = []
for i in soup.find_all("div", {"jscontroller" : "LVJlx"}, {"class" : "UD7Dzf"}):
    per_review = i.find("span")
    reviews.append(per_review)

print(reviews)    

The output is an empty list.



Solution 1:[1]

The jscontroller and class values won't be consistent across different URLS. You could try something like

soup.find_all('div', {'jscontroller': True}) 

But that will not give you all the reviews as they are dynamically added when you scroll down the page.

That means you need to scrape the page with an actual browser or you can try to reverse engineer the API calls by using Developer Tools.

e.g.

enter image description here

Solution 2:[2]

You can parse reviews data from the inline <script> tags via regular expression. And then use a regular expression to parse user name, avatar, comment, and so on.

Pagination is done in the same kinda fashion by parsing the next page token via regular expression and making a POST request instead of GET.

Network tab:

enter image description here

Page source, somewhere in the <script> tags where you see the same page token:

enter image description here


Code and example in the online IDE to scrape the first 40 reviews:

# everything here could be refactored to look more simplifed

from bs4 import BeautifulSoup
import requests, lxml, re, json
from datetime import datetime

# user-agent headers to act as a "real" user visit
headers = {
    "user-agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36"
}

# search query params
params = {
    "id": "com.nintendo.zara",  # app name
    "gl": "ES"                  # country
}


html = requests.get("https://play.google.com/store/apps/details", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")

# temporary store user comments
app_user_comments = []

# https://regex101.com/r/SrP5DS/1
app_user_reviews_data = re.findall(r"(\[\"gp.*?);</script>",
                                str(soup.select("script")), re.DOTALL)

for review in app_user_reviews_data:
    # https://regex101.com/r/M24tiM/1
    user_name = re.findall(r"\"gp:.*?\",\s?\[\"(.*?)\",", str(review))
    
    # https://regex101.com/r/TGgR45/1
    user_avatar = [avatar.replace('"', "") for avatar in re.findall(r"\"gp:.*?\"(https.*?\")", str(review))]

    # replace single/double quotes at the start/end of a string
    # https://regex101.com/r/iHPOrI/1
    user_comments = [comment.replace('"', "").replace("'", "") for comment in
                    re.findall(r"gp:.*?https:.*?]]],\s?\d+?,.*?,\s?(.*?),\s?\[\d+,", str(review))]

    # https://regex101.com/r/Z7vFqa/1
    user_comment_app_rating = re.findall(r"\"gp.*?https.*?\],(.*?)?,", str(review))
    
    # https://regex101.com/r/jRaaQg/1
    user_comment_likes = re.findall(r",?\d+\],?(\d+),?", str(review))
    
    # comment utc timestamp
    # use datetime.utcfromtimestamp(int(date)).date() to have only a date
    user_comment_date = [str(datetime.utcfromtimestamp(int(date))) for date in re.findall(r"\[(\d+),", str(review))]
    
    # https://regex101.com/r/GrbH9A/1
    user_comment_id = [ids.replace('"', "") for ids in re.findall(r"\[\"(gp.*?),", str(review))]
    
    for index, (name, avatar, comment, date, comment_id, likes, user_app_rating) in enumerate(zip(
        user_name,
        user_avatar,
        user_comments,
        user_comment_date,
        user_comment_id,
        user_comment_likes,
        user_comment_app_rating), start=1):

        app_user_comments.append({
            "position": index,
            "name": name,
            "avatar": avatar,
            "comment": comment,
            "app_rating": user_app_rating,
            "comment_likes": likes,
            "comment_published_at": date,
            "comment_id": comment_id
        })
        
print(json.dumps(app_user_comments, indent=2, ensure_ascii=False))

Part of the output:

]
  {
    "position": 1,
    "user_name": "cohen rigg",
    "user_avatar": "https://play-lh.googleusercontent.com/a-/AOh14Gj8OIwh0fMd13LXsIhaULJtqe39k3HwjH7AIZjTdQ",
    "comment": "This game is a good game. Its fun to simply pick up and play. However I have two major problems that make me never want to play this game again. 1. Microtransactions. You have to buy the entire game/story mode and wait for tickets to play different modes like remix 10 or toad rally. But you could just pay up and not wait. 2. Wifi problems. Not sure if this is my problem but the game never wants to work. At the end of the day there are better games on mobile worth more of your time.",
    "app_rating": "2",
    "comment_likes": "22",
    "comment_published_at": "2022-04-16 09:01:36",
    "comment_id": "gp:AOqpTOGljqOaIofAehHYtGN2ay-hnYEigfYD4hgzPoLseth5l-BzPn-RaIShuKzakplra0V1E3KJIu-AfsG5mA"
  }, ... other results
  {
    "position": 40,
    "user_name": "Claire Barrett",
    "user_avatar": "https://play-lh.googleusercontent.com/a-/AOh14GgeG3YaXc7tvjnl7kom2vYaTm4lXwS8UEDiZZV4BA",
    "comment": "After purchasing this is a super fun game, the game modes are fun and is a well executed idea. That is if you can get a \\good\\ network connection. Consistently when playing remix 10 I will get error after error and need to close the game, swipe it out of my recents, open it and wait for it to load all over again. A minor detail on top of this is the game is incredibly loud, I cant listen with music because the sound effects completely cover any music I put on.",
    "app_rating": "3",
    "comment_likes": "28",
    "comment_published_at": "2019-01-05 07:01:35",
    "comment_id": "gp:AOqpTOEOm_ilgrrHynfDLHvEusMHgvXtlwjSY-7SHBxH1Z-jgQQF62TRcFU4TQBQsFBaN1hNid3-yufUOV4IcQ"
  }
]

Alternatively, you can do it using Google Play Product API from SerpApi. It's a paid API with a free plan.

The difference is that you don't have to figure out how to parse the data and then maintain the parser over time. Figure out how to bypass blocks from Google or figure out how to implement pagination. Check out the playground.

Example code to integrate to parse first 40 results:

from serpapi import GoogleSearch
import json

params = {
  "api_key": "API_KEY",                # your serpapi api key
  "engine": "google_play_product",     # search engine
  "store": "apps",                     
  "gl": "es",                          # country to search from: Spain
  "product_id": "com.nintendo.zara",   # app ID
  "all_reviews": "true"                # show all reviews
}

search = GoogleSearch(params)          # where data extraction happens
results = search.get_dict()            # JSON -> Python dict

for review in results["reviews"]:
    print(json.dumps(review, indent=2))

Part of the output:

{
  "title": "cohen rigg",
  "avatar": "https://play-lh.googleusercontent.com/a-/AOh14Gj8OIwh0fMd13LXsIhaULJtqe39k3HwjH7AIZjTdQ",
  "rating": 2.0,
  "snippet": "This game is a good game. It's fun to simply pick up and play. However I have two major problems that make me never want to play this game again. 1. Microtransactions. You have to buy the entire game/story mode and wait for tickets to play different modes like remix 10 or toad rally. But you could just pay up and not wait. 2. Wifi problems. Not sure if this is my problem but the game never wants to work. At the end of the day there are better games on mobile worth more of your time.",
  "likes": 22,
  "date": "April 16, 2022"
} ... other results
{
  "title": "Claire Barrett",
  "avatar": "https://play-lh.googleusercontent.com/a-/AOh14GgeG3YaXc7tvjnl7kom2vYaTm4lXwS8UEDiZZV4BA",
  "rating": 3.0,
  "snippet": "After purchasing this is a super fun game, the game modes are fun and is a well executed idea. That is if you can get a \"good\" network connection. Consistently when playing remix 10 I will get error after error and need to close the game, swipe it out of my recents, open it and wait for it to load all over again. A minor detail on top of this is the game is incredibly loud, I can't listen with music because the sound effects completely cover any music I put on.",
  "likes": 28,
}

To implement pagination you can do like this:

from serpapi import GoogleSearch
import json
from urllib.parse import urlsplit, parse_qsl


params = {
  "api_key": "API_KEY",                # your serpapi api key
  "engine": "google_play_product",     # search engine
  "store": "apps",                     
  "gl": "es",                          # country to search from: Spain
  "product_id": "com.nintendo.zara",   # app ID
  "all_reviews": "true"                # show all reviews
}

search = GoogleSearch(params)

# just to track what page is currently parsed
index = 0

reviews_is_present = True
while reviews_is_present:
    results = search.get_dict()        # JSON -> Python dict

    # update page number
    index += 1
    for review in results.get("reviews", []):
        
        print(f"\npage #: {index}\n")
        print(json.dumps(review, indent=2))
        
    if "next" in results.get("serpapi_pagination", []):
        search.params_dict.update(dict(parse_qsl(urlsplit(results.get("serpapi_pagination").get("next")).query)))
    else:
        reviews_is_present = False

Part of the output:


page #: 1

{
  "title": "cohen rigg",
  "avatar": "https://play-lh.googleusercontent.com/a-/AOh14Gj8OIwh0fMd13LXsIhaULJtqe39k3HwjH7AIZjTdQ",
  "rating": 2.0,
  "snippet": "This game is a good game. It's fun to simply pick up and play. However I have two major problems that make me never want to play this game again. 1. Microtransactions. You have to buy the entire game/story mode and wait for tickets to play different modes like remix 10 or toad rally. But you could just pay up and not wait. 2. Wifi problems. Not sure if this is my problem but the game never wants to work. At the end of the day there are better games on mobile worth more of your time.",
  "likes": 22,
  "date": "April 16, 2022"
} ... other results

page #: 3

{
  "title": "Abbas Katebi",
  "avatar": "https://play-lh.googleusercontent.com/a/AATXAJx8y5Om_FMp3cpzCcQFlgSE7BYngAM6xtyZDuME=mo",
  "rating": 1.0,
  "snippet": "I purchased the game but the restore purchase button doesn't work and it says you have no content can be restored I have been trying to play world 2 for 8 days but still can't access to the full game I have been sending inquiries for 8 days but every time I sent inquiries they said restart the app how many times should I say I restarted the app for many times and it doesn't work solve my problem or give my money back I wonder why I bought the game it's support doesn't care about its customers",
  "likes": 29,
  "date": "March 10, 2022"
} ... other results

Disclaimer, I work for SerpApi.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2