'Convert Pyspark DataFrame to multi-dimensional NumPy array
I have a DataFrame of over 100 million rows of data df, df is basically like:
| row | col | ax1 | ax2 | value |
|---|---|---|---|---|
| 996 | 965 | 12 | 8 | 10000 |
| 236 | 1015 | 8 | 8 | 10001 |
| 26 | 315 | 4 | 7 | 10002 |
| ... | ... | ... | ... | ... |
I'm trying to convert it to a numpy array, with the shape (1024, 1024, 16, 16), and save it to driver. The content of expected numpy array arr is like:
arr[996, 995, 12, 8] = 10000
arr[236, 1015, 8, 8] = 10001
.
.
.
I've tried to use df.collect(), to collect the data to driver and iterate over the DataFrame to write into arr. But the driver machine is just out of memory.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
