'How to scrape all data from first page to last page using beautifulsoup
I have been trying to scrape all data from the first page to the last page, but it returns only the first page as the output. How can I solve this? Below is my code:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
from time import sleep
from random import randint
pages = np.arange(2, 1589, 20)
for page in pages:
page = requests.get( "https://estateintel.com/app/projects/search?q=%7B%22sectors%22%3A%5B%22residential%22%5D%7D&page="+str(page))
sleep(randint(2,10))
soup = BeautifulSoup(page.content, 'html.parser')
lists = soup.find_all('div', class_="project-card-vertical h-full flex flex-col rounded border-thin border-inactive-blue overflow-hidden pointer")
for list in lists:
title = list.find('p', class_ ="project-location text-body text-base mb-3").text. replace ('\n', '',).strip()
location = list.find('span', class_ ="text-gray-1").text. replace ('\n', '',).strip()
status = list.find('span', class_ ="text-purple-1 font-bold").text. replace ('\n', '',).strip()
units = list.find('span', class_ ="text-body font-semibold").text. replace ('\n', '',).strip()
info = [title,location,status,units]
print(info)
Solution 1:[1]
The page is loaded dynamically using the API. Therefore, with a regular GET request, you will always get the first page. You need to study how the page communicates with the browser and find the request you need, I wrote an example for review.
import json
import requests
def get_info(page):
url = f"https://services.estateintel.com/api/v2/properties?type\\[\\]=residential&page={page}"
headers = {
'accept': 'application/json',
'authorization': 'false',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.84 Safari/537.36'
}
response = requests.request("GET", url, headers=headers)
json_obj = json.loads(response.text)
for data in json_obj['data']:
print(data['name'])
print(data['area'], data['state'])
print(data['status'])
print(data['size']['value'], data['size']['unit'])
print('------')
for page in range(1, 134):
get_info(page)
You can choose the fields you need, this is just an example, also add to dataframe. Output:
Twin Oaks Apartment
Kilimani Nairobi
Completed
0 units
------
Duchess Park
Lavington Nairobi
Completed
62 units
------
Greenvale Apartments
Kileleshwa Nairobi
Completed
36 units
------
The Urban apartments & Suites
Osu Greater Accra
Completed
28 units
------
Chateau Towers
Osu Greater Accra
Completed
120 units
------
Cedar Haus Gardens
Oluyole Oyo
Under Construction
38 units
------
10 Agoro Street
Oluyole Oyo
Completed
1 units
..............
Solution 2:[2]
Think it is working well, but needs the time
to sleep
- Just in case, you could select your elements more specific e.g. with css selectors
and store information in a list of dicts instead just printing it.
Example
import pandas as pd
import requests
from bs4 import BeautifulSoup
from time import sleep
from random import randint
data = []
for page in range(1,134):
print(page)
page = requests.get( "https://estateintel.com/app/projects/search?q=%7B%22sectors%22%3A%5B%22residential%22%5D%7D&page="+str(page))
sleep(randint(2,10))
soup = BeautifulSoup(page.content, 'html.parser')
for item in soup.select('div.project-grid > a'):
data.append({
'title' : item.h3.text.strip(),
'location' : item.find('span', class_ ="text-gray-1").text.strip(),
'status' : item.find('span', class_ ="text-purple-1 font-bold").text.strip(),
'units' : item.find('span', class_ ="text-body font-semibold").text.strip()
})
pd.DataFrame(data)
Output
title | location | status | units | |
---|---|---|---|---|
0 | Twin Oaks Apartment | Kilimani, Nairobi | Completed | Size: -- |
1 | Duchess Park | Lavington, Nairobi | Completed | Size: 62 units |
2 | Greenvale Apartments | Kileleshwa, Nairobi | Completed | Size: 36 units |
3 | The Urban apartments & Suites | Osu, Greater Accra | Completed | Size: 28 units |
4 | Chateau Towers | Osu, Greater Accra | Completed | Size: 120 units |
5 | Cedar Haus Gardens | Oluyole, Oyo | Under Construction | Size: 38 units |
6 | 10 Agoro Street | Oluyole, Oyo | Completed | Size: 1 units |
7 | Villa O | Oluyole, Oyo | Completed | Size: 2 units |
8 | Avenue Road Apartments | Oluyole, Oyo | Completed | Size: 6 units |
9 | 15 Alafia Street | Oluyole, Oyo | Completed | Size: 4 units |
10 | 12 Saint Mary Street | Oluyole, Oyo | Nearing Completion | Size: 8 units |
11 | RATCON Estate | Oluyole, Oyo | Completed | Size: -- |
12 | 1 Goodwill Road | Oluyole, Oyo | Completed | Size: 4 units |
13 | Anike's Court | Oluyole, Oyo | Completed | Size: 3 units |
14 | 9 Adeyemo Quarters | Oluyole, Oyo | Completed | Size: 4 units |
15 | Marigold Residency | Nairobi West, Nairobi | Under Construction | Size: -- |
16 | Kings Distinction | Kilimani, Nairobi | Completed | Size: -- |
17 | Riverview Apartments | Kyumvi, Machakos | Completed | Size: -- |
18 | Serene Park | Kyumvi, Machakos | Under Construction | Size: -- |
19 | Gitanga Duplexes | Lavington, Nairobi | Under Construction | Size: 36 units |
20 | Westpointe Apartments | Upper Hill, Nairobi | Completed | Size: 254 units |
21 | 10 Olaoluwa Street | Oluyole, Oyo | Under Construction | Size: 12 units |
22 | Rosslyn Grove | Nairobi West, Nairobi | Under Construction | Size: 90 units |
23 | 7 Kamoru Ajimobi Street | Oluyole, Oyo | Completed | Size: 2 units |
Solution 3:[3]
#pip install trio httpx pandas
import trio
import httpx
import pandas as pd
allin = []
keys1 = ['name', 'area', 'state']
keys2 = ['value', 'unit']
async def scraper(client, page):
client.params = client.params.merge({'page': page})
r = await client.get('/properties')
allin.extend([[i.get(k, 'N/A') for k in keys1] +
[i['size'].get(b, 'N/A')
for b in keys2] for i in r.json()['data']])
async def main():
async with httpx.AsyncClient(timeout=None, base_url='https://services.estateintel.com/api/v2') as client, trio.open_nursery() as nurse:
client.params = {
'type[]': 'residential'
}
for page in range(1, 3):
nurse.start_soon(scraper, client, page)
df = pd.DataFrame(allin, columns=[keys1 + keys2])
print(df)
if __name__ == "__main__":
trio.run(main)
Output:
0 Cedar Haus Gardens Oluyole Oyo 38 units
1 10 Agoro Street Oluyole Oyo 1 units
2 Villa O Oluyole Oyo 2 units
3 Avenue Road Apartments Oluyole Oyo 6 units
4 15 Alafia Street Oluyole Oyo 4 units
5 12 Saint Mary Street Oluyole Oyo 8 units
6 RATCON Estate Oluyole Oyo 0 units
7 1 Goodwill Road Oluyole Oyo 4 units
8 Anike's Court Oluyole Oyo 3 units
9 9 Adeyemo Quarters Oluyole Oyo 4 units
10 Marigold Residency Nairobi West Nairobi 0 units
11 Riverview Apartments Kyumvi Machakos 0 units
12 Socian Villa Apartments Kileleshwa Nairobi 36 units
13 Kings Pearl Residency Lavington Nairobi 55 units
14 Touchwood Gardens Kilimani Nairobi 32 units
15 Panorama Apartments Upper Hill Nairobi 0 units
16 Gitanga Duplexes Lavington Nairobi 36 units
17 Serene Park Kyumvi Machakos 25 units
18 Kings Distinction Kilimani Nairobi 48 units
19 Twin Oaks Apartment Kilimani Nairobi 0 units
20 Duchess Park Lavington Nairobi 70 units
21 Greenvale Apartments Kileleshwa Nairobi 36 units
22 The Urban apartments & Suites Osu Greater Accra 28 units
23 Chateau Towers Osu Greater Accra 120 units
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Sergey K |
Solution 2 | HedgeHog |
Solution 3 | αԋɱҽԃ αмєÑιcαη |