'When is spark groupby preferred over reducebykey?
My dataset is pretty big and I would like to understand when groupby
makes sense over reducebykey
?
Solution 1:[1]
reduceByKey performs map side combine which reduces the amount of data sent over the network during shuffle and thereby also reduces the amount of data reduced. Where possible, use reducebyKey
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Amar Singh |