'How to scrape a data from a dynamic website containing Javascript using Python?
I am trying to scrape data from https://www.doordash.com/food-delivery/chicago-il-restaurants/
The idea is to scrape all the data regarding the different restaurant listings on the website. The site is divided into different cities, but I only require restaurant data for Chicago.
All restaurant listings for the city have to be scraped along with any other relevant data about the respective restaurants (Ex: Reviews, Rating, Cuisine, address, state etc). I need to capture all the respective details(currently 4,326 listings) for the city in the Excel.
I have tried to extract the restaurant name, cuisine, ratings and review inside the class named "StoreCard_root___1p3uN". But No datas have been displayed. The output is blank.
from selenium import webdriver
chrome_path = r"D:\python project\chromedriver.exe"
driver = webdriver.Chrome(chrome_path)
driver.get("https://www.doordash.com/food-delivery/chicago-il-restaurants/")
driver.find_element_by_xpath("""//*[@id="SeoApp"]/div/div[1]/div/div[2]/div/div[2]/div/div[2]/div[1]/div[3]""").click()
posts = driver.find_elements_by_class_name("StoreCard_root___1p3uN")
for post in posts:
print(post.text) ```
Solution 1:[1]
you can use the API
url as the data rendered from it actually via XHR
request.
iterate over the API
link below and scrape
whatever you want.
You will just loop over this parameter offset=0
by increasing it +50 each time as each page will shown 50
items till you reach 4300
as it's the last page ! simply by range(0, 4350, 50)
import requests
import pandas as pd
data = []
for item in range(0, 4350, 50):
print(f"Extracting item# {item}")
r = requests.get(
f"https://api.doordash.com/v2/seo_city_stores/?delivery_city_slug=chicago-il-restaurants&store_only=true&limit=50&offset={item}").json()
for item in r['store_data']:
item = (item['name'], item['city'], item['category'],
item['num_ratings'], item['average_rating'], item['average_cost'])
data.append(item)
df = pd.DataFrame(
data, columns=['Name', 'City', 'Category', 'Num Ratings', 'Average Ratings', 'Average Cost'])
df.to_csv('output.csv', index=False)
print("done")
Sample of Output:
View Output online: Click Here
Full Data is here: Click Here
Solution 2:[2]
I was faced with this issue too, but I solved it using selenium and BeautifulSoup by doing the following:
- Make sure the algorithm clicks button to reveal Menu and prices if necessary
- The menu and prices have to be processed because they will might come off as nested list after the extraction from parsing so the get_text() function won't work on them right away. The code and explanation can be found in this medium article
Solution 3:[3]
I have checked out the API that ????? ?????c?? mentions. They also had an endpoint for restaurant info.
URL https://api.doordash.com/v2/restaurant/[restaurantId]/
It was working until recently when it started returning {"detail":"Request was throttled."}
Has anyone had the same issue / found a workaround?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | |
Solution 3 | Stuart Murless |