'Web Scraping price AirBnB data with Python
I have been trying to web scrape an air bnb website to obtain the price without much luck. I have successfully been able to bring in the other areas of interest (home description, home location, reviews, etc). Below is what I've tried unsuccessfully. I think that the fact the "price" on the web page is a 'span class' as opposed to the others which are 'div class' is where my issue is, but I'm speculating.
This can be placed as the input in the below code.
Any assistance would be greatly appreciated.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn
from bs4 import BeautifulSoup
import requests
from IPython.display import IFrame
input_string = input("""Enter URLs for AirBnB sites that you want webscraped AND separate by a ',' : """)
airbnb_list = []
try:
airbnb_list = input_string.split(",")
x = 0
y = len(airbnb_list)
while y >= x:
print(x+1 , '.) ' , airbnb_list[x])
x=x+1
if y == x:
break
#print(airbnb_list[len(airbnb_list)])
except:
print("""Please separate list by a ','""")
a = pd.DataFrame([{"Title":'', "Stars": '', "Size":'', "Check In":'', "Check Out":'', "Rules":'',
"Location":'', "Home Type":'', "House Desc":''}])
for x in range(len(airbnb_list)):
url = airbnb_list[x]
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
stars = soup.find(class_='_c7v1se').get_text()
desc = soup.find(class_='_12nksyy').get_text()
size = soup.find(class_='_jro6t0').get_text()
#checkIn = soup.find(class_='_1acx77b').get_text()
checkIn = soup.find(class_='_12aeg4v').get_text()
#checkOut = soup.find(class_='_14tl4ml5').get_text()
checkOut = soup.find(class_='_12aeg4v').get_text()
Rules = soup.find(class_='cihcm8w dir dir-ltr').get_text()
#location = soup.find(class_='_9ns6hl').get_text()
location = soup.find(class_='_152qbzi').get_text()
HomeType = soup.find(class_='_b8stb0').get_text()
title = soup.title.string
print('Stars: ', stars)
print('')
#Home Type
print('Home Type: ', HomeType)
print('')
#Space Description
print('Description: ', desc)
print('')
print('Rental size: ',size)
print('')
#CheckIn
print('Check In: ', checkIn)
print('')
#CheckOut
print('Check Out: ', checkOut)
print('')
#House Rules
print('House Rules: ',Rules)
print('')
#print(soup.find("button", {"id":"#Id name of the button"}))
#Home Location
print('Home location: ', location)
#Dates available
#print('Dates available: ', soup.find(class_='_1yhfti2').get_text())
print('===================================================================================')
df = pd.DataFrame([{"Title":title, "Stars": stars, "Size":size, "Check In":checkIn, "Check Out":checkOut, "Rules":Rules,
"Location":location, "Home Type":HomeType, "House Desc":desc}])
a = a.append(df)
#Attemping to print the price tag on the website
print(soup.find_all('span', {'class': '_tyxjp1'}))
print(soup.find(class_='_tyxjp1').get_text())
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-10-2d9689dbc836> in <module>
1 #print(soup.find_all('span', {'class': '_tyxjp1'}))
----> 2 print(soup.find(class_='_tyxjp1').get_text())
AttributeError: 'NoneType' object has no attribute 'get_text'
Solution 1:[1]
I see you are using the requests
module to scrape airbnb.
That module is extremely versatile and works on websites that have static content.
However, it has one major drawback: it doesn't render content created by javascript.
This is a problem, as most of the websites these days create additional html elements using javascript once the user lands on the web page.
The airbnb price block is created exactly like that - using javascript.
There are many ways to scrape that kind of content. My favourite way is to use selenium. It's basically a library that allows you to launch a real browser and communicate with it using your programming language of choice.
Here's how you can easily use selenium.
First, set it up. Notice the headless option which can be toggled on and off. Toggle it off if you want to see how the browser loads the webpage
# setup selenium (I am using chrome here, so chrome has to be installed on your system)
chromedriver_autoinstaller.install()
options = Options()
# if you set this to False if you want to see how the chrome window loads airbnb - useful for debugging
options.headless = True
driver = webdriver.Chrome(options=options)
Then, navigate to the website
# navigate to airbnb
driver.get(url)
Next, wait until the price block loads. It might appear near instantaneous to us, but depending on the speed of your internet connection it might take a few seconds
# wait until the price block loads
timeout = 10
expectation = EC.presence_of_element_located((By.CSS_SELECTOR, '._tyxjp1'))
price_element = WebDriverWait(driver, timeout).until(expectation)
And finally, print the price
# print the price
print(price_element.get_attribute('innerHTML'))
I added my code to your example so you could play around with it
import chromedriver_autoinstaller
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
import pandas as pd
from bs4 import BeautifulSoup
import requests
from selenium.webdriver.common.by import By
input_string = input("""Enter URLs for AirBnB sites that you want webscraped AND separate by a ',' : """)
airbnb_list = []
try:
airbnb_list = input_string.split(",")
x = 0
y = len(airbnb_list)
while y >= x:
print(x+1 , '.) ' , airbnb_list[x])
x=x+1
if y == x:
break
#print(airbnb_list[len(airbnb_list)])
except:
print("""Please separate list by a ','""")
a = pd.DataFrame([{"Title":'', "Stars": '', "Size":'', "Check In":'', "Check Out":'', "Rules":'',
"Location":'', "Home Type":'', "House Desc":''}])
# setup selenium (I am using chrome here, so chrome has to be installed on your system)
chromedriver_autoinstaller.install()
options = Options()
# if you set this to False if you want to see how the chrome window loads airbnb - useful for debugging
options.headless = True
driver = webdriver.Chrome(options=options)
for x in range(len(airbnb_list)):
url = airbnb_list[x]
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
# navigate to airbnb
driver.get(url)
# wait until the price block loads
timeout = 10
expectation = EC.presence_of_element_located((By.CSS_SELECTOR, '._tyxjp1'))
price_element = WebDriverWait(driver, timeout).until(expectation)
# print the price
print(price_element.get_attribute('innerHTML'))
Keep in mind that your IP might eventually get banned for scraping AirBnb. To work around that it is always a good idea to use proxy IPs and rotate them. Follow this rotating proxies tutorial to avoid getting blocked.
Hope that helps!
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Zyy |