'How to divide two aggreate sum dataframe
I want to divide the sum of two columns in pyspark. For example, I have a datasets like below:
A B C
1 1 2 3
2 1 2 3
3 1 2 3
What I want is to get sum of colA
divide by sum of colB
as below:
6 (Sum of colB) / 3 (Sum of colA) = 2
I have tried this:
sumofA = df.groupby().sum('A')
sumofB = df.groupby().sum('B')
Result = B / A
but it produces this error:
TypeError: unsupported operand type(s) for /: 'DataFrame' and 'DataFrame'
Solution 1:[1]
Your approach was correct, but you could just do the calculation inside the aggregation function only.
from pyspark.sql import functions as F
df.groupBy().agg(F.sum("B")/F.sum("A")).show()
+-----------------+
|(sum(B) / sum(A))|
+-----------------+
| 2.0|
+-----------------+
OR, you can collect it as a value using collect()[0][0]
from pyspark.sql import functions as F
a=df.groupBy().agg(F.sum("B")/F.sum("A")).collect()[0][0]
a
Out[5]: 2.0
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | murtihash |