'Get certain number of elements from Itertools.combinations

I need to split the total number of elements in iterator : tot= itertools.combinations(dict1.keys(), 2) into 3 parts.

The size of dict1 = 285056 Total combinations possible = 40billion

My goal is to somehow divide these 40billion into 3 parts of 13.5 billion elements each to process on different processors parallely. At the moment i am naively iterating the 40billion and dumping pickle files when i reach 13.5 billion which isnt efficient as each 13.5 billion pickle is 160gb on disk (much larger when loaded in memory)

So is there any way I could iterate the 40billion till 13.5billionth element in one code and then start from 13.6 billionth element in code 2 and so on without iteration like i did. Below code i use to get certain number of elements from combinations iterable.

def grouper(n, iterable):
      it = iter(iterable)
      while True:
          chunk = tuple(itertools.islice(it, n))
          if not chunk:
              return
          yield chunk
for first_chunk in grouper(1350000000,tot ): 


Solution 1:[1]

It is easy to create this kind of split with itertools. Given a set of elements, we can test if the first part of the generated combination belongs to the computation for machine i.

In the code below, I show a crude solution for this, with the code in the for loop intended to be split over 3 machines. Machine i will run the code for the i-th segment of the keys for the first element of the combination, combined with the full set for the second element.

The combinations are supposed to be processed in the line where cnt2 is calculated. Replace that with the kind of for loop you want to process your combinations with.

Compared with generating and storing all combinations, this solution does not store any, but it will (internally) generate all. But what's a couple of billion combinations between friends?

import itertools

def is_not_for_machine(i, t):
    """ t is in the set if first element
        in my_set_prefix[i] for machine i """
    if my_set_prefix[i][0] <= t[0] < my_set_prefix[i][1]:
        return False
    return True

my_set_prefix = []
for i in range(3):
    my_set_prefix.append((len(my_keys)*i//3, len(my_keys)*(i+1)//3))
print(f"== partition: {my_set_prefix}")
my_keys = range(12)
all = itertools.combinations(my_keys, 2)
cnt = len([_ for _ in all])
print(f"== total set size {cnt}")
for i in range(3):
    all = itertools.combinations(my_keys, 2)
    cnt2 = len([_ for _ in itertools.filterfalse(lambda t: is_not_for_machine(i, t), all)])
    print(f"== set size for prefix {my_set_prefix[i]}: {cnt2}")

The output shows that some load balancing might be necessary, since this partition is "triangular descending", with the highest count for the first combinations.

== partition: [(0, 4), (4, 8), (8, 12)]
== total set size 66
== set size for prefix (0, 4): 38
== set size for prefix (4, 8): 22
== set size for prefix (8, 12): 6

Solution 2:[2]

why not directly use the math.comb command to get the number of combinations?

Just go to the question there.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 VirtualScooter
Solution 2 Fang WU