'Multi-Threaded Python scraper does not execute functions
I am writing a multi-threaded python scraper. I am facing an issue where my script quits after running for 0.39 seconds without any error. It seem that the parse_subcategory() function is never being ran from parse_category(). Without multi-threading everything seemed to work fine and I just can't find the issue why it is not running the function. What may be the problem, is it not passing headers to get the proper html response, or ThreadPoolExecutor .map() function is not writen properly?
Code:
from bs4 import BeautifulSoup
import requests
import concurrent.futures
BASEURL = 'https://www.motomoto.lt'
HEADERS = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36'}
page = requests.get(BASEURL, headers=HEADERS)
soup = BeautifulSoup(page.content, 'html.parser')
item_list = []
def main():
with concurrent.futures.ThreadPoolExecutor(max_workers=8) as executor:
parse_category(soup, executor)
def parse_category(soup, executor):
executor.map(
lambda url: parse_subcategory(url, executor),
*[BASEURL + a['href'] for a in soup.find_all('a', class_='subcategory-name', href=True)])
def parse_subcategory(url, executor):
subcategoryPage = requests.get(url, headers=HEADERS)
soup = BeautifulSoup(subcategoryPage.content, 'html.parser')
executor.map(
lambda url: parse_products(url, executor),
*[BASEURL + a['href'] for a in soup.find_all('a', class_='subcategory-image', href=True)])
def parse_products(url, executor):
productsPage = requests.get(url, headers=HEADERS)
soup = BeautifulSoup(productsPage.content, 'html.parser')
executor.map(
lambda url: parse_item(url, executor),
*[a['href'] for a in soup.find_all('a', class_='thumbnail product-thumbnail', href=True)])
this = soup.find('a', attrs={'class':'next'}, href=True)
if this is not None:
nextpage = BASEURL + this['href']
print('-' * 70)
parse_products(nextpage)
def parse_item(url):
itemPage = requests.get(url, headers=HEADERS)
soup = BeautifulSoup(itemPage.content, 'html.parser')
title = get_title(soup)
price = get_price(soup)
category = get_category(soup)
item = {
'Title': title,
'Price': price,
'Category': category
}
item_list.append(item)
print(item)
def get_title(soup):
title = soup.find('h1', class_='h1')
title_value = title.string
title_string = title_value.strip()
return title_string
def get_price(soup):
price = soup.find('span', attrs={'itemprop':'price'}).string.strip()
return price
def get_category(soup):
category = soup.find_all("li", attrs={'itemprop':'itemListElement'})[1].find('span', attrs={'itemprop':'name'}).getText()
return category
if __name__ == "__main__":
main()
Solution 1:[1]
ThreadPoolExecutor.map() is an asynchronous function, it will not wait all the tasks/futures completed. The procedure goes like this
- a. in
main()
,parse_category()
submits a set of tasks which will executeparse_subcatetory()
later - b. then
parse_category()
returned immediately - c. in
main()
,executor.shutdown()
was invoked since with statement is used
At this time, parse_subcatetory()
should be still in progress, however the executor.shutdown
will lead to submit parse_products
failed
We can surround the body of parse_subcategory()
inside a try...except to check if any unexpected exception occurred; In this case, we will got
RuntimeError('cannot schedule new futures after shutdown')
A simple solution is to wait executor.map() completed at proper places, could refer to this question How do I wait for ThreadPoolExecutor.map to finish
A simple solution might be counting the pending tasks, increase by one when submit a task, and decrease by one when complete a task. And in the main(), just wait the counter become 0.
By the way, there are some syntax errors in above code
The second param of executor.map is a list of argument iterator(one iterator per argument), there is no need to put a leading
*
in above case unless you pass a list of iterator/list. Just remove the*
!In
parse_products()
:parse_item()
only accept a single parameter, the same issue to theparse_products()
itself, at the end of theparse_products()
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 |