'is there a faster way to get multiple keys from dictionary?
I have a dictionary:
d = {'a':1, 'b':2, 'c':3, 'd':4}
Then I have a list of keys:
l = ['a', 'b', 'z']
My desired result is:
[1, 2, None]
What I'm doing so far is:
[d.get(k) for k in l]
Is there a faster way? Perhaps without for
?
Solution 1:[1]
You could use:
>>> list(map(d.get, l))
[1, 2, None]
It has two advantages:
- It performs the
d.get
lookup only once - not each iteration - Only CPython: Because
dict.get
is implemented in C andmap
is implemented in C it can avoid the Python layer in the function call (roughly speaking the details are a bit more complicated).
As for timings (performed on Python 3.6 in a Jupyter notebook):
d = {'a':1, 'b':2, 'c':3, 'd':4}
l = ['a', 'b', 'z']
%timeit list(map(d.get, l))
594 ns ± 41.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit [d.get(k) for k in l]
508 ns ± 17.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Note that this is actually slower in this case! That's because for short iterables the map
and list
overhead dominate. So if you want it faster on short iterables stick with your approach.
With longer l
you see that list(map(...))
eventually becomes faster:
d = {'a':1, 'b':2, 'c':3, 'd':4}
l = [random.choice(string.ascii_lowercase) for _ in range(10000)]
%timeit list(map(d.get, l))
663 µs ± 64.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit [d.get(k) for k in l]
1.13 ms ± 7.55 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
However that's still "just" a factor of 2 faster.
Solution 2:[2]
It is important to note that the bottle neck here is not the size of the dictionary, but the size of the list of keys (and of course method look-up time, time it takes to hash the key being looked up, etc.)
Consider the following:
from timeit import Timer
d = {1: 'a'}
keys = [1 for _ in range(1000)]
def _map():
return list(map(d.get, keys))
def _map_single_lookup():
g = d.get
return list(map(g, keys))
def _list_comp():
return [d.get(key) for key in keys]
def _list_comp_single_lookup():
g = d.get
return [g(key) for key in keys]
print(min(Timer(_map).repeat(100, 100)))
print(min(Timer(_map_single_lookup).repeat(100, 100)))
print(min(Timer(_list_comp).repeat(100, 100)))
print(min(Timer(_list_comp_single_lookup).repeat(100, 100)))
# 0.009307396466818774
# 0.009261678214412816
# 0.018456645101335933
# 0.011634828724497837
Solution 3:[3]
A much faster alternative would be to use itemgetter
with defaultdict
(because itemgetter
doesn't support a default value like dict.get
in case the key doesn't exist)
from collections import defaultdict
from timeit import Timer
from operator import itemgetter
d = defaultdict(lambda: None)
d[1] = 'a'
keys = [1 for _ in range(1000)]
def _map():
return list(map(d.get, keys))
def _getter():
return list(itemgetter(*keys)(d))
print(min(Timer(_map).repeat(100, 100)))
print(min(Timer(_getter).repeat(100, 100)))
# 0.0074976040767260055
# 0.0021861597102568187
Edit Added timings for non-existing keys (integers and strings). No significant impact on the performance.
from collections import defaultdict
from timeit import Timer
from operator import itemgetter
d = defaultdict(lambda: None)
d[1] = 'a'
non_existing_keys_int = [2 for _ in range(1000)]
non_existing_keys_string = ['a' for _ in range(1000)]
def get_non_existing_keys_int():
return list(itemgetter(*non_existing_keys_int)(d))
def get_non_existing_keys_string():
return list(itemgetter(*non_existing_keys_string)(d))
print(min(Timer(get_non_existing_keys_int).repeat(100, 100)))
print(min(Timer(get_non_existing_keys_string).repeat(100, 100)))
# 0.002647169132724954
# 0.002408539023506795
Solution 4:[4]
getpy is a library that allows fast parallel hashmap operations based on a C++ lib by what better name than parallel-hashmap? Here's an ipython session that shows getting 1000 values is over 200x faster using parallel read!
In [1]: import numpy as np
...: import getpy as gp
In [2]: key_type = np.dtype('u8')
...: value_type = np.dtype('u8')
In [3]: keys = np.random.randint(1, 1000, size=10**2, dtype=key_type)
...: values = np.random.randint(1, 1000, size=10**2, dtype=value_type)
...:
...: gp_dict = gp.Dict(key_type, value_type, default_value=42)
...: gp_dict[keys] = values
...:
...: random_keys = np.random.randint(1, 1000, size=500, dtype=key_type)
In [4]: %timeit random_values = gp_dict[random_keys]
2.19 µs ± 11.6 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
In [7]: %timeit [gp_dict[k] for k in random_keys]
491 µs ± 3.51 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | |
Solution 3 | |
Solution 4 | crizCraig |