'Get certain number of elements from Itertools.combinations
I need to split the total number of elements in iterator :
tot= itertools.combinations(dict1.keys(), 2)
into 3 parts.
The size of dict1 = 285056 Total combinations possible = 40billion
My goal is to somehow divide these 40billion into 3 parts of 13.5 billion elements each to process on different processors parallely. At the moment i am naively iterating the 40billion and dumping pickle files when i reach 13.5 billion which isnt efficient as each 13.5 billion pickle is 160gb on disk (much larger when loaded in memory)
So is there any way I could iterate the 40billion till 13.5billionth element in one code and then start from 13.6 billionth element in code 2 and so on without iteration like i did. Below code i use to get certain number of elements from combinations iterable.
def grouper(n, iterable):
it = iter(iterable)
while True:
chunk = tuple(itertools.islice(it, n))
if not chunk:
return
yield chunk
for first_chunk in grouper(1350000000,tot ):
Solution 1:[1]
It is easy to create this kind of split with itertools
.
Given a set of elements, we can test if the first part of
the generated combination belongs to the computation for
machine i.
In the code below, I show a crude solution for this, with the code in the for loop intended to be split over 3 machines. Machine i will run the code for the i-th segment of the keys for the first element of the combination, combined with the full set for the second element.
The combinations are supposed to be processed in the line
where cnt2
is calculated. Replace that with the kind of for
loop you want to process your combinations with.
Compared with generating and storing all combinations, this solution does not store any, but it will (internally) generate all. But what's a couple of billion combinations between friends?
import itertools
def is_not_for_machine(i, t):
""" t is in the set if first element
in my_set_prefix[i] for machine i """
if my_set_prefix[i][0] <= t[0] < my_set_prefix[i][1]:
return False
return True
my_set_prefix = []
for i in range(3):
my_set_prefix.append((len(my_keys)*i//3, len(my_keys)*(i+1)//3))
print(f"== partition: {my_set_prefix}")
my_keys = range(12)
all = itertools.combinations(my_keys, 2)
cnt = len([_ for _ in all])
print(f"== total set size {cnt}")
for i in range(3):
all = itertools.combinations(my_keys, 2)
cnt2 = len([_ for _ in itertools.filterfalse(lambda t: is_not_for_machine(i, t), all)])
print(f"== set size for prefix {my_set_prefix[i]}: {cnt2}")
The output shows that some load balancing might be necessary, since this partition is "triangular descending", with the highest count for the first combinations.
== partition: [(0, 4), (4, 8), (8, 12)]
== total set size 66
== set size for prefix (0, 4): 38
== set size for prefix (4, 8): 22
== set size for prefix (8, 12): 6
Solution 2:[2]
why not directly use the math.comb command to get the number of combinations?
Just go to the question there.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | VirtualScooter |
Solution 2 | Fang WU |