'Convert Pyspark DataFrame to multi-dimensional NumPy array

I have a DataFrame of over 100 million rows of data df, df is basically like:

row col ax1 ax2 value
996 965 12 8 10000
236 1015 8 8 10001
26 315 4 7 10002
... ... ... ... ...

I'm trying to convert it to a numpy array, with the shape (1024, 1024, 16, 16), and save it to driver. The content of expected numpy array arr is like:

arr[996, 995, 12, 8] = 10000
arr[236, 1015, 8, 8] = 10001
.
.
.

I've tried to use df.collect(), to collect the data to driver and iterate over the DataFrame to write into arr. But the driver machine is just out of memory.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source