'How to use multiprocessing for sequential memory hungry tasks?
I am trying to execute a series of tasks that use a lot of memory because of the objects sizes. Basically here are the steps :
a = building_function()
saving_to_disk(a)
b = building_function(a)
saving_to_disk(b)
c = building_function(b)
saving_to_disk(c)
d = building_function(b, c)
saving_to_disk(d)
When b is built, I do not need "a" anymore, but it is still in memory. Same thing for b and c when d is built. I've tried "del a" but it doesn't work.
So here's the MWE of what I'm thinking of trying. Before I waste lots of time implementing it, is that the right approach ?
def loadData():
return 'data'
def building_function(x, y, res_queue):
result = "using " + x + " to build " + y
res_queue.put(result)
def saving_to_disk(res_queue):
res = res_queue.get()
while res != 'END':
print(res + " (saved)")
res = res_queue.get()
res_queue = multiprocessing.Queue()
p = multiprocessing.Process(target=saving_to_disk, args=[res_queue])
data = loadData()
p.start()
building_function(data, 'a', res_queue)
building_function('a', 'b', res_queue)
building_function('b', 'c', res_queue)
building_function('b and c', 'd', res_queue)
res_queue.put("END")
p.join()
Solution 1:[1]
I really don't see what "memory management" creating processes brings to what you are trying to accomplish. What you want to do is set a reference to an instance that uses a lot of memory to None
when you no longer need it so that it can be garbage collected and call gc.collect()
to force the immediate garbage collection (although garbage collection should occur automatically as necessary):
import gc
def loadData():
return 'data'
def building_function(x):
# Compute result from x:
...
return result
def saving_to_disk(result):
print(result + " (saved)")
data = loadData()
a = building_function(data)
saving_to_disk(a)
# We don't need data any more:
data = None
gc.collect()
b = building_function(a)
saving_to_disk(b)
# We don't need a any more:
a = None
gc.collect()
c = building_function(b)
# We don't need b any more:
b = None
gc.collect()
saving_to_disk(c)
Solution 2:[2]
gc.collect() really didn't work for me so I decided to cut my process into several functions that I call from a bash file using separate commands. In the end it's simpler and more straightforward
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Booboo |
Solution 2 | Johann |