'Creating a python web scraper to get metadata for google play store apps

I am very new to Python and am really interested in learning more. I have been given a task by a course I am doing currently...

  • Please write a small Python script that crawls the Google Play web store (https://play.google.com/store) for a particular apps listing, and stores the app store listing information in an output folder.
  • The script should extract the following information from the apps page: icon, title, description and screenshots.
  • I should be able to run the script by the following command: python app_fetcher.py <app_id>. The metadata should then be stored in a folder in the current directory (e.g. ./<app_id>)
  • Bonus Points! Also fetch the apps store listing subtitle, or anything else you find interesting.

I have made a start on this but am not sure how to actually go about doing the web scraping part of the script. Would anyone be able to advise. I don't know what libraries to use or functions to call. I have looked online but it all involves installing additional packages. Here is what I have so far, any help would be appreciated!!!...

# Function to crawl Google Play Store and obtain data
def web_crawl(app_id):
 import os, sys, urllib2
 try:
  # Obtain the URL for the app
  url = "https://play.google.com/store/apps/details?id=" + app_id

  # open url for reading
  response = urllib2.urlopen(url)

  # Get path of py file to store txt file locally
  fpath = os.path.dirname(os.path.realpath(sys.argv[0]))

  # Open file to store app metadata
  with open(fpath + "\web_crawl.txt", "w") as f:
     f.write("Google Play Store Web Crawler \n")
     f.write("Metadata for " + app_id + "\n")
     f.write("***************************************  \n")
     f.write("Icon: "  + "\n")
     f.write("Title: " + "\n")
     f.write("Description: "  + "\n")
     f.write("Screenshots: "  + "\n")

     # Added subtitle 
     f.write("Subtitle: "  + "\n")

     # Close file after write
     f.close()
   except urllib2.HTTPError, e:
   print("HTTP Error: ")
   print(e.code)
  except urllib2.URLError, e:
    print("URL Error: ")
    print(e.args)

# Call web_crawl function
web_crawl("com.cmplay.tiles2")


Solution 1:[1]

I advise you to use BeautifulSoup. To start, use this code

from bs4 import BeautifulSoup
r = requests.get("url");
# optionally check status code here
soup = BeautifulSoup(r.text)

using the soup object you can use selectors to extract elements from a page

read more here: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Solution 2:[2]

In order to parse the icon, title, description, and especially screenshots you have to parse it from inline JSON using regular expression. You can achieve it using browser automation but it would be slower.

It is safer than parsing with CSS selectors because they would likely to change in the future.


Code and full example in the online IDE using requests, beautifulsoup, lxml and regular expressions:

from bs4 import BeautifulSoup
import requests, lxml, re, json


def scrape_google_play_app(appname: str) -> dict[str, Union[list[dict[str, float]]:

     headers = {
        "user-agent":
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36"
    }

    params = {
        "id": appname,
        "gl": "us"      # country
        # other search parameters
    }

    html = requests.get("https://play.google.com/store/apps/details", params=params, headers=headers, timeout=10)
    soup = BeautifulSoup(html.text, "lxml")

    # where all app data will be stored
    app_data = []

    # <script> position is not changing that's why [12] index being selected. Other <script> tags position are changing.
    # [12] index is a basic app information
    # https://regex101.com/r/DrK0ih/1
    basic_app_info = json.loads(re.findall(r"<script nonce=\".*\" type=\"application/ld\+json\">(.*?)</script>",
                                            str(soup.select("script")[12]), re.DOTALL)[0])

    app_name = basic_app_info["name"]
    app_type = basic_app_info["@type"]
    app_url = basic_app_info["url"]
    app_description = basic_app_info["description"].replace("\n", "")  # replace new line character to nothing
    app_category = basic_app_info["applicationCategory"]
    app_operating_system = basic_app_info["operatingSystem"]
    app_main_thumbnail = basic_app_info["image"]

    app_content_rating = basic_app_info["contentRating"]
    app_rating = round(float(basic_app_info["aggregateRating"]["ratingValue"]), 1)  # 4.287856 -> 4.3
    app_reviews = basic_app_info["aggregateRating"]["ratingCount"]

    app_author = basic_app_info["author"]["name"]
    app_author_url = basic_app_info["author"]["url"]

    # https://regex101.com/r/VX8E7U/1
    app_images_data = re.findall(r",\[\d{3,4},\d{3,4}\],.*?(https.*?)\"", str(soup.select("script")))
    # delete duplicates from app_images_data
    app_images = [item for item in app_images_data if app_images_data.count(item) == 1]

    app_data.append({
        "app_name": app_name,
        "app_type": app_type,
        "app_url": app_url,
        "app_main_thumbnail": app_main_thumbnail,
        "app_description": app_description,
        "app_content_rating": app_content_rating,
        "app_category": app_category,
        "app_operating_system": app_operating_system,
        "app_rating": app_rating,
        "app_reviews": app_reviews,
        "app_author": app_author,
        "app_author_url": app_author_url,
        "app_screenshots": app_images
    })

    return app_data

print(json.dumps(scrape_google_play_app(appname="com.nintendo.zara"), indent=2))

Define a function and annotate the returned value:

def scrape_google_play_app(appname: str) -> dict[str, Union[list[dict[str, float]]:
    # whatever
  • appname should be a string.
  • return value from the function will be a -> list.

Create headers and search query params:

headers = {
    "user-agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36"
}

params = {
    "id": appname,  # app name
    "gl": "US"      # country
}

Pass headers, params, make a request and create a BeautifulSoup object where all HTML processing will happen:

html = requests.get("https://play.google.com/store/apps/details", params=params, headers=headers, timeout=10)
soup = BeautifulSoup(html.text, "lxml")
  • timeout will tell requests to stop waiting for a response after 10 seconds.
  • lxml is a XML/HTML parser.

Create a temporary list where all app data will be temporarily stored, match app info from inline JSON using regular expression:

app_data = []

# https://regex101.com/r/DrK0ih/1
basic_app_info = json.loads(re.findall(r"<script nonce=\".*\" type=\"application/ld\+json\">(.*?)</script>",
                                        str(soup.select("script")[12]), re.DOTALL)[0])
  • json.load() will convert JSON string to Python dictionary.

Get the data from the parsed JSON string:

app_name = basic_app_info["name"]
app_type = basic_app_info["@type"]
app_url = basic_app_info["url"]
app_description = basic_app_info["description"].replace("\n", "")  # replace new line character to nothing
app_category = basic_app_info["applicationCategory"]
app_operating_system = basic_app_info["operatingSystem"]
app_main_thumbnail = basic_app_info["image"]

app_content_rating = basic_app_info["contentRating"]
app_rating = round(float(basic_app_info["aggregateRating"]["ratingValue"]), 1)  # 4.287856 -> 4.3
app_reviews = basic_app_info["aggregateRating"]["ratingCount"]

app_author = basic_app_info["author"]["name"]
app_author_url = basic_app_info["author"]["url"]

Match screenshots data via regular expression and filter duplicates:

# https://regex101.com/r/VX8E7U/1
app_images_data = re.findall(r",\[\d{3,4},\d{3,4}\],.*?(https.*?)\"", str(soup.select("script")))
# delete duplicates from app_images_data
app_images = [item for item in app_images_data if app_images_data.count(item) == 1]

Append the data to a temporary list and return it:

app_data.append({
    "app_name": app_name,
    "app_type": app_type,
    "app_url": app_url,
    "app_main_thumbnail": app_main_thumbnail,
    "app_description": app_description,
    "app_content_rating": app_content_rating,
    "app_category": app_category,
    "app_operating_system": app_operating_system,
    "app_rating": app_rating,
    "app_reviews": app_reviews,
    "app_author": app_author,
    "app_author_url": app_author_url,
    "app_screenshots": app_images
})

return app_data

Print the data:

print(json.dumps(scrape_google_play_app(appname="com.nintendo.zara"), indent=2))

Full output:

[
  {
    "app_name": "Super Mario Run",
    "app_type": "SoftwareApplication",
    "app_url": "https://play.google.com/store/apps/details/Super_Mario_Run?id=com.nintendo.zara&hl=en_US&gl=US",
    "app_main_thumbnail": "https://play-lh.googleusercontent.com/5LIMaa7WTNy34bzdFhBETa2MRj7mFJZWb8gCn_uyxQkUvFx_uOFCeQjcK16c6WpBA3E",
    "app_description": "A new kind of Mario game that you can play with one hand.You control Mario by tapping as he constantly runs forward. You time your taps to pull off stylish jumps, midair spins, and wall jumps to gather coins and reach the goal!Super Mario Run can be downloaded for free and after you purchase the game, you will be able to play all the modes with no additional payment required. You can try out all four modes before purchase: World Tour, Toad Rally, Remix 10, and Kingdom Builder.\u25a0World TourRun and jump with style to rescue Princess Peach from Bowser\u2019s clutches! Travel through plains, caverns, ghost houses, airships, castles, and more.Clear the 24 exciting courses to rescue Princess Peach from Bowser, waiting in his castle at the end. There are many ways to enjoy the courses, such as collecting the 3 different types of colored coins or by competing for the highest score against your friends. You can try courses 1-1 to 1-4 for free.After rescuing Princess Peach, a nine-course special world, World Star, will appear.\u25a0Remix 10Some of the shortest Super Mario Run courses you'll ever play!This mode is Super Mario Run in bite-sized bursts! You'll play through 10 short courses one after the other, with the courses changing each time you play. Daisy is lost somewhere in Remix 10, so try to clear as many courses as you can to find her!\u25a0Toad RallyShow off Mario\u2019s stylish moves, compete against your friends, and challenge people from all over the world.In this challenge mode, the competition differs each time you play.Compete against the stylish moves of other players for the highest score as you gather coins and get cheered on by a crowd of Toads. Fill the gauge with stylish moves to enter Coin Rush Mode to get more coins. If you win the rally, the cheering Toads will come live in your kingdom, and your kingdom will grow. \u25a0Kingdom BuilderGather coins and Toads to build your very own kingdom.Combine different buildings and decorations to create your own unique kingdom. There are over 100 kinds of items in Kingdom Builder mode. If you get more Toads in Toad Rally, the number of buildings and decorations available will increase. With the help of the friendly Toads you can gradually build up your kingdom.\u25a0What You Can Do After Purchasing All Worlds\u30fb All courses in World Tour are playableWhy not try out the bigger challenges and thrills available in all courses?\u30fb Easier to get Rally TicketsIt's easier to get Rally Tickets that are needed to play Remix 10 and Toad Rally. You can collect them in Kingdom Builder through Bonus Game Houses and ? Blocks, by collecting colored coins in World Tour, and more.\u30fb More playable charactersIf you rescue Princess Peach by completing course 6-4 and build homes for Luigi, Yoshi, and Toadette in Kingdom Builder mode, you can get them to join your adventures as playable characters. They play differently than Mario, so why not put their special characteristics to good use in World Tour and Toad Rally?\u30fb More courses in Toad RallyThe types of courses available in Toad Rally will increase to seven different types of courses, expanding the fun! Along with the new additions, Purple and Yellow Toads may also come to cheer for you.\u30fb More buildings and decorations in Kingdom BuilderThe types of buildings available will increase, so you'll be able to make your kingdom even more lively. You can also place Rainbow Bridges to expand your kingdom.\u30fb Play Remix 10 without having to waitYou can play Remix 10 continuously, without having to wait between each game.*Internet connectivity required to play. Data charges may apply. May contain advertisements.",
    "app_content_rating": "Everyone",
    "app_category": "GAME_ACTION",
    "app_operating_system": "ANDROID",
    "app_rating": 4.0,
    "app_reviews": "1619972",
    "app_author": "Nintendo Co., Ltd.",
    "app_author_url": "https://supermariorun.com/",
    "app_screenshots": [
      "https://play-lh.googleusercontent.com/dcv6Z-pr3MsSvxYh_UiwvJem8fktDUsvvkPREnPaHYienbhT31bZ2nUqHqGpM1jdal8",
      "https://play-lh.googleusercontent.com/SVYZCU-xg-nvaBeJ-rz6rHSSDp20AK-5AQPfYwI38nV8hPzFHEqIgFpc3LET-Dmu-Q",
      "https://play-lh.googleusercontent.com/Nne-dalTl8DJ9iius5oOLmFe-4DnvZocgf92l8LTV0ldr9JVQ2BgeW_Bbjb5nkVngrQ",
      "https://play-lh.googleusercontent.com/yIqljB_Jph_T_ITmVFTpmDV0LKXVHWmsyLOVyEuSjL2794nAhTBaoeZDpTZZLahyRsE",
      "https://play-lh.googleusercontent.com/5HdGRlNsBvHTNLo-vIsmRLR8Tr9degRfFtungX59APFaz8OwxTnR_gnHOkHfAjhLse7e",
      "https://play-lh.googleusercontent.com/bPhRpYiSMGKwO9jkjJk1raR7cJjMgPcUFeHyTg_I8rM7_6GYIO9bQm6xRcS4Q2qr6mRx",
      "https://play-lh.googleusercontent.com/7DOCBRsIE5KncQ0AzSA9nSnnBh0u0u804NAgux992BhJllLKGNXkMbVFWH5pwRwHUg",
      "https://play-lh.googleusercontent.com/PCaFxQba_CvC2pi2N9Wuu814srQOUmrW42mh-ZPCbk_xSDw3ubBX7vOQeY6qh3Id3YE",
      "https://play-lh.googleusercontent.com/fQne-6_Le-sWScYDSRL9QdG-I2hWxMbe2QbDOzEsyu3xbEsAb_f5raRrc6GUNAHBoQ",
      "https://play-lh.googleusercontent.com/ql7LENlEZaTq2NaPuB-esEPDXM2hs1knlLa2rWOI3uNuQ77hnC1lLKNJrZi9XKZFb4I",
      "https://play-lh.googleusercontent.com/UIHgekhfttfNCkd5qCJNaz2_hPn67fOkv40_5rDjf5xot-QhsDCo2AInl9036huUtCwf",
      "https://play-lh.googleusercontent.com/7iH7-GjfS_8JOoO7Q33JhOMnFMK-O8k7jP0MUI75mYALK0kQsMsHpHtIJidBZR46sfU",
      "https://play-lh.googleusercontent.com/czt-uL-Xx4fUgzj_JbNA--RJ3xsXtjAxMK7Q_wFZdoMM6nL_g-4S5bxxX3Di3QTCwgw",
      "https://play-lh.googleusercontent.com/e5HMIP0FW9MCoAEGYzji9JsrvyovpZ3StHiIANughp3dovUxdv_eHiYT5bMz38bowOI",
      "https://play-lh.googleusercontent.com/nv2BP1glvMWX11mHC8GWlh_UPa096_DFOKwLZW4DlQQsrek55pY2lHr29tGwf2FEXHM",
      "https://play-lh.googleusercontent.com/xwWDr_Ib6dcOr0H0OTZkHupwSrpBoNFM6AXNzNO27_RpX_BRoZtKIULKEkigX8ETOKI",
      "https://play-lh.googleusercontent.com/AxHkW996UZvDE21HTkGtQPU8JiQLzNxp7yLoQiSCN29Y54kZYvf9aWoR6EzAlnoACQ",
      "https://play-lh.googleusercontent.com/xFouF73v1_c5kS-mnvQdhKwl_6v3oEaLebsZ2inlJqIeF2eenXjUrUPJsjSdeAd41w",
      "https://play-lh.googleusercontent.com/a1pta2nnq6f_b9uV0adiD9Z1VVQrxSfX315fIQqgKDcy8Ji0BRC1H7z8iGnvZZaeg80",
      "https://play-lh.googleusercontent.com/SDAFLzC8i4skDJ2EcsEkXidcAJCql5YCZI76eQB15fVaD0j-ojxyxea00klquLVtNAw",
      "https://play-lh.googleusercontent.com/H7BcVUoygPu8f7oIs2dm7g5_vVt9N9878f-rGd0ACd-muaDEOK2774okryFfsXv9FaI",
      "https://play-lh.googleusercontent.com/5LIMaa7WTNy34bzdFhBETa2MRj7mFJZWb8gCn_uyxQkUvFx_uOFCeQjcK16c6WpBA3E",
      "https://play-lh.googleusercontent.com/DGQjTn_Hp32i88g2YrbjrCwl0mqCPCzDjTwMkECh3wXyTv4y6zECR5VNbAH_At89jGgSJDQuSKsPSB-wVQ",
      "https://play-lh.googleusercontent.com/pzvdI66OFjncahvxJN714Tu5pHUJ_nJK--vg0tv5cpgaGNvjfwsxC-SKxoQh9_n_wEcCdSQF9FeuZeI"
    ]
  }
]

Alternatively, you can Google Google Play Product API from SerpApi which is a paid API with a free plan to test out. Check out the playground.

The difference is that you don't have to figure out how to bypass blocks from Google, figure out how to solve CAPTHCA, how to scale the scale it if you need it, maintain the parser over time.

Example code to integrate:

from serpapi import GoogleSearch
from urllib.parse import urlsplit, parse_qsl
import json, os


# https://docs.python.org/3/library/os.html#os.getenv
params = {
    "api_key": os.getenv("API_KEY"),     # your serpapi api key
    "engine": "google_play_product",     # search engine
    "store": "apps",                     
    "gl": "us",                          # country to search from: Spain
    "product_id": "com.nintendo.zara",   # app ID
    "all_reviews": "true"                # show all reviews
}

search = GoogleSearch(params)          # where data extraction happens

# page number
index = 0

reviews_is_present = True
while reviews_is_present:
    results = search.get_dict()        # JSON -> Python dict

    # update page number
    index += 1
    for review in results.get("reviews", []):
        
        print(f"\npage #: {index}\n")
        print(json.dumps(review, indent=2))
    
    # check if next page is present
    # if present -> splits URL in parts as a dict
    # and passes to GoogleSearch() class with new page data
    if "next" in results.get("serpapi_pagination", []):
        search.params_dict.update(dict(parse_qsl(urlsplit(results.get("serpapi_pagination").get("next")).query)))
    else:
        reviews_is_present = False

Outputs:

page #: 1

{
  "title": "Hervey Carraway",
  "avatar": "https://play-lh.googleusercontent.com/a/AATXAJzsEl1do3ADXzVM157yNQWAu-osNvMg3nyDykNq=mo",
  "rating": 4.0,
  "snippet": "Re-installed Super Mario Run on a new device, and having had my Nintendo Account and Google Play Account previously linked to my game, some of my unlocks were reflected, characters unlocked, achievements and cosmetics were still in my kingdom, but the game is trying to prompt me to re-purchase the full-access to play the full Tour mode. Fun game for the occasional burst of Mario action on the go, but can't recommend at all if progress isn't retained even after doing the proper backup steps.",
  "likes": 357,
  "date": "February 21, 2022"
} ... other reviews

page #: 4

{
  "title": "Ellie-Ann Cowan",
  "avatar": "https://play-lh.googleusercontent.com/a/AATXAJxM17sISGHGOPIHsJMhOCAvWpDNr5o2rGZrVOkj=mo",
  "rating": 4.0,
  "snippet": "Great game!I only rated \ud83c\udf1f \ud83c\udf1f \ud83c\udf1f\ud83c\udf1f because of the thing where u buy the rest of the game I know it's only cheap but I never buy from any game, not even to complete it.but there is another reason 4 it to.it never saves when u delete it.becaise I accidentally deleted it by accedent.I would defo recommend to everyone interested in mario.",
  "likes": 122,
  "date": "March 07, 2022"
} ... other reviews

A line-by-line Scrape Google Play Store App in Python blog post tutorial of mine at SerpApi. Additionally, an easier approach would be to use Python google-play-search-scraper to do everything for you.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Zun
Solution 2