'Convert Pyspark DataFrame to multi-dimensional NumPy array
I have a DataFrame of over 100 million rows of data df
, df
is basically like:
row | col | ax1 | ax2 | value |
---|---|---|---|---|
996 | 965 | 12 | 8 | 10000 |
236 | 1015 | 8 | 8 | 10001 |
26 | 315 | 4 | 7 | 10002 |
... | ... | ... | ... | ... |
I'm trying to convert it to a numpy array, with the shape (1024, 1024, 16, 16), and save it to driver. The content of expected numpy array arr
is like:
arr[996, 995, 12, 8] = 10000
arr[236, 1015, 8, 8] = 10001
.
.
.
I've tried to use df.collect()
, to collect the data to driver and iterate over the DataFrame to write into arr
. But the driver machine is just out of memory.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|