'Multi-Threaded Python scraper does not execute functions

I am writing a multi-threaded python scraper. I am facing an issue where my script quits after running for 0.39 seconds without any error. It seem that the parse_subcategory() function is never being ran from parse_category(). Without multi-threading everything seemed to work fine and I just can't find the issue why it is not running the function. What may be the problem, is it not passing headers to get the proper html response, or ThreadPoolExecutor .map() function is not writen properly?

Code:

from bs4 import BeautifulSoup
import requests
import concurrent.futures


BASEURL = 'https://www.motomoto.lt'
HEADERS = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36'}

page = requests.get(BASEURL, headers=HEADERS)
soup = BeautifulSoup(page.content, 'html.parser')

item_list = []

def main():
    with concurrent.futures.ThreadPoolExecutor(max_workers=8) as executor:
        parse_category(soup, executor)

def parse_category(soup, executor):
    executor.map(
        lambda url: parse_subcategory(url, executor),
        *[BASEURL + a['href'] for a in soup.find_all('a', class_='subcategory-name', href=True)])

def parse_subcategory(url, executor):
    subcategoryPage = requests.get(url, headers=HEADERS)
    soup = BeautifulSoup(subcategoryPage.content, 'html.parser')
    executor.map(
        lambda url: parse_products(url, executor),
        *[BASEURL + a['href'] for a in soup.find_all('a', class_='subcategory-image', href=True)])

def parse_products(url, executor):
    productsPage = requests.get(url, headers=HEADERS)
    soup = BeautifulSoup(productsPage.content, 'html.parser')
    executor.map(
        lambda url: parse_item(url, executor),
        *[a['href'] for a in soup.find_all('a', class_='thumbnail product-thumbnail', href=True)])

    this = soup.find('a', attrs={'class':'next'}, href=True)
    if this is not None:
        nextpage = BASEURL + this['href']
        print('-' * 70)
        parse_products(nextpage)
        
        
def parse_item(url):
    itemPage = requests.get(url, headers=HEADERS)
    soup = BeautifulSoup(itemPage.content, 'html.parser')
    title = get_title(soup)
    price = get_price(soup)
    category = get_category(soup)
    
    item = {
        'Title': title,
        'Price': price,
        'Category': category
    }

    item_list.append(item)
    print(item)
    
def get_title(soup):
    title = soup.find('h1', class_='h1')
    title_value = title.string
    title_string = title_value.strip()
    return title_string

def get_price(soup):
    price = soup.find('span', attrs={'itemprop':'price'}).string.strip()
    return price

def get_category(soup):
    category = soup.find_all("li", attrs={'itemprop':'itemListElement'})[1].find('span', attrs={'itemprop':'name'}).getText()
    return category

if __name__ == "__main__":
    main() 


Solution 1:[1]

ThreadPoolExecutor.map() is an asynchronous function, it will not wait all the tasks/futures completed. The procedure goes like this

  • a. in main(), parse_category() submits a set of tasks which will execute parse_subcatetory() later
  • b. then parse_category() returned immediately
  • c. in main(), executor.shutdown() was invoked since with statement is used

At this time, parse_subcatetory() should be still in progress, however the executor.shutdown will lead to submit parse_products failed

We can surround the body of parse_subcategory() inside a try...except to check if any unexpected exception occurred; In this case, we will got

RuntimeError('cannot schedule new futures after shutdown')

A simple solution is to wait executor.map() completed at proper places, could refer to this question How do I wait for ThreadPoolExecutor.map to finish

A simple solution might be counting the pending tasks, increase by one when submit a task, and decrease by one when complete a task. And in the main(), just wait the counter become 0.


By the way, there are some syntax errors in above code

  1. The second param of executor.map is a list of argument iterator(one iterator per argument), there is no need to put a leading * in above case unless you pass a list of iterator/list. Just remove the *!

  2. In parse_products(): parse_item() only accept a single parameter, the same issue to the parse_products() itself, at the end of the parse_products()

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1