'Convert Pyspark DataFrame to multi-dimensional NumPy array

I have a DataFrame of over 100 million rows of data df, df is basically like:

row	col	ax1	ax2	value
996	965	12	8	10000
236	1015	8	8	10001
26	315	4	7	10002
...	...	...	...	...

I'm trying to convert it to a numpy array, with the shape (1024, 1024, 16, 16), and save it to driver. The content of expected numpy array arr is like:

arr[996, 995, 12, 8] = 10000
arr[236, 1015, 8, 8] = 10001
.
.
.

I've tried to use df.collect(), to collect the data to driver and iterate over the DataFrame to write into arr. But the driver machine is just out of memory.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'Convert Pyspark DataFrame to multi-dimensional NumPy array

Sources

Related Questions