'Combing concurrent.future.as_complete() with dictionary using zip()

I am a first time user of concurrent.futures and following the official guides.

Problem: Inside the as_completed() block, how do I access the k, v which is inside the future_to_url?

The k variable is vital.

Using something like:

for (future, k,v) in zip(concurrent.futures.as_completed(future_to_url), urls.items()):

I stumbled on this post however I cannot decipher the syntax to reproduce

Original

def start():
    with concurrent.futures.ThreadPoolExecutor(max_workers=50) as executor:
        future_to_url = {executor.submit(visit_url, v): v for k, v in urls.items()}
        for future in concurrent.futures.as_completed(future_to_url):
            data = future.result()
            json = data.json()
            print(f"k: {future[k]}")

Second Attempt - Using zip which breaks

def start():
    with concurrent.futures.ThreadPoolExecutor(max_workers=50) as executor:
        future_to_url = {executor.submit(visit_url, v): v for k, v in urls.items()}
        for (future, k, v) in zip(concurrent.futures.as_completed(future_to_url), urls.items()):
            data = future.result()
            json = data.json()
            print(f"k: {k}")

Third Broken Attempt - Using Map source

for future, (k, v) in map(concurrent.futures.as_completed(future_to_url), scraping_robot_urls.items()):

TypeError: 'generator' object is not callable

Fourth Broken Attempt - Storing the k,v pairs before the as_completed() loop and pairing them with an enumerate index

    with concurrent.futures.ThreadPoolExecutor(max_workers=50) as executor:
        future_to_url = {executor.submit(get_response, v): v for k, v in scraping_robot_urls.items()}
        info = {k: v for k, v in scraping_robot_urls.items()}
        for i, future in enumerate(concurrent.futures.as_completed(future_to_url)):
            url = future_to_url[future]
            data = future.result()
            print(f"data: {data}")
            print(f"key: {list(info)[i]} / url: {url}")

This does not work as the URL, does not match the key, they seem to be mismatched, and I cannot rely on this behaviour working.

For completeness, here are the dependencies

def visit_url(url):
    return requests.get(url)

urls = {
  'id123': 'www.google.com', 
  'id456': 'www.bing.com', 
  'id789': 'www.yahoo.com'
}

Sources of inspiration:



Solution 1:[1]

This has nothing to do with futures and more to do with list comprehension.

    future_to_url = {executor.submit(visit_url, v): v for k, v in urls.items()}

Is looping everything in the urls dict and getting the key and value(k, v) and submitting that to the executor to run visit_url. k and v will not be available outside of the for loop because the scope of those variables belong to the for loop.

If you want to have the results of the call and what URL it was called on you can pass the URL back as a return tuple:

from tornado import concurrent


def start():
    with concurrent.futures.ThreadPoolExecutor(max_workers=50) as executor:
        future_to_url = {executor.submit(visit_url, k, v): v for k, v in urls.items()}
        for future in concurrent.futures.as_completed(future_to_url):
            id, data = future.result()
            json = data.json()
            print(f"id: {id}")
            print(f"data: {json}")

def visit_url(id, url):
    return id, requests.get(url)

urls = {
  'id123': 'www.google.com',
  'id456': 'www.bing.com',
  'id789': 'www.yahoo.com'
}

After comments made by OP (mainly that this seems dirty by using the scope of the visit_url function to pass context/keys back after exec) I can propose a more OOP way of doing this:

import requests
from tornado import concurrent

class URL:
    def __init__(self, id, url):
        self.id = id
        self.url = url
        self.response = None

    def vist(self):
        self.response = requests.get(self.url)
        return self

def start():
    with concurrent.futures.ThreadPoolExecutor(max_workers=50) as executor:
        future_to_url = {executor.submit(c.vist): c for c in urls}
        for future in concurrent.futures.as_completed(future_to_url):
            data = future.result()
            print(f"response: {data.response}")
            print(f"id: {data.id}")

urls = [
  URL('id123', 'http://www.google.com'),
  URL('id456', 'http://www.bing.com'),
  URL('id789', 'http://www.yahoo.com')
]

start()

This ensures the response, ID and URL are together in their class which might be cleaner for some. The for loop to submit to the executor is simplified as well.

Solution 2:[2]

I came up with this simple solution to using as_completed with a dictionary.

run as_compled using the dictionary values(), then match the result with the results in the dictionary to retrieve the key.

Retrieve the result and assign it to a dictionary using the key.

data={}
with concurrent.futures.ThreadPoolExecutor(max_workers=50) as executor:
    future_to_url = {k: executor.submit(visit_url, v) for k, v in urls.items()}
    for i in concurrent.futures.as_completed(future_to_url):
            for k, v in future_to_url.items():
                if v == i:
                    data[k] = future_to_url[k].result()
print(data)

It would be very simple to put something like this inside the as_completed() function. If the object passed to as_completed() is a dictionary, it would return the key or key, value with as_completed(dict).items().

Solution 3:[3]

For posterity, I was inspired by testfile's response.

I resolved this issue by sneaking the k inside the visit_url() function.

def visit_url(url, k):
    return k, requests.get(url)

I now have access to the key, inside the as_completed() loop. It is predictable, as the key and URL will match. Unlike binding an outer loop, with one inside the as_completed() loop. Which irregularly behaved, due to the external requests resolving in random order.

    with concurrent.futures.ThreadPoolExecutor(max_workers=50) as executor:
        future_to_url = {executor.submit(visit_url, v, k): v for k, v in scraping_robot_urls.items()}
        for future in concurrent.futures.as_completed(future_to_url):
            url = future_to_url[future]
            key, data = future.result()
            print(f"key: {key} / url: {url}")

This resolution feels to me like a hack, as I am using the scope of another function to pass "state/variable" to something else.

I am going to bounty this question, as I would appreciate learning how to better handle this situation and educate myself.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2
Solution 3 dimButTries