'Groupby & Sum - Create new column with added If Condition
I have the below DataFrame:
ID Start End Variance
1 100000 120000 20000
1 1 0 -1
1 7815.58 7815.58 0
1 5261 5261 0
1 138783.2 89969.37 -48813.83
1 2459.92 2459.92 0
2 101421.99 93387.45 -8034.54
2 940.04 940.04 0
2 63.06 63.06 0
2 2454.86 2454.86 0
2 830 830 0
2 299 299 0
2 14000 12000 2000
2 1500 500 1000
I want to create a new column, Overspend Total
. But I only want to sum the values that are greater than 0. The resulting DataFrame will look like this:
ID Start End Variance Overspend Total
1 100000 120000 20000 20000
1 1 0 -1 20000
1 7815.58 7815.58 0 20000
1 5261 5261 0 20000
1 138783.2 89969.37 -48813.83 20000
1 2459.92 2459.92 0 20000
2 101421.99 93387.45 -8034.54 3000
2 940.04 940.04 0 3000
2 63.06 63.06 0 3000
2 2454.86 2454.86 0 3000
2 830 830 0 3000
2 299 299 0 3000
2 14000 12000 2000 3000
2 1500 500 1000 3000
I tried the following
df['Overspend Variance'] = df[df['Variance'] > 0].groupby(df['ID']).transform('sum')
But I'm getting the below error:
ValueError: Wrong number of items passed 8, placement implies 1
I know df['Overspend Variance'] = df['Variance'].groupby(df['ID']).transform('sum')
would work without a condition, but I can't figure out how to incorporate it with an extra condition.
Solution 1:[1]
We can use Series.where
to replace the values that don't match the condition with NaN
, then just groupby transform
'sum' since NaN
values are ignored by 'sum' by default:
df['Overspend Total'] = (
df['Variance'].where(df['Variance'] > 0).groupby(df['ID']).transform('sum')
)
Or explicitly replace with the additive identity (0) which will not affect the sum:
df['Overspend Total'] = (
df['Variance'].where(df['Variance'] > 0, 0)
.groupby(df['ID']).transform('sum')
)
Or with a lambda
inside groupby transform
:
df['Overspend Total'] = df.groupby('ID')['Variance'].transform(
lambda s: s[s > 0].sum()
)
In any case df
is:
ID Start End Variance Overspend Total
0 1 100000.00 120000.00 20000.00 20000.0
1 1 1.00 0.00 -1.00 20000.0
2 1 7815.58 7815.58 0.00 20000.0
3 1 5261.00 5261.00 0.00 20000.0
4 1 138783.20 89969.37 -48813.83 20000.0
5 1 2459.92 2459.92 0.00 20000.0
6 2 101421.99 93387.45 -8034.54 3000.0
7 2 940.04 940.04 0.00 3000.0
8 2 63.06 63.06 0.00 3000.0
9 2 2454.86 2454.86 0.00 3000.0
10 2 830.00 830.00 0.00 3000.0
11 2 299.00 299.00 0.00 3000.0
12 2 14000.00 12000.00 2000.00 3000.0
13 2 1500.00 500.00 1000.00 3000.0
Solution 2:[2]
it could be done by filter the value less than 0 than group by and reassign
df = df.join(df[df.Variance>=0].groupby("ID")["Variance"].agg(sum), on="ID", rsuffix="total")
df.columns = ["ID", "Start", "End", "Variance", "Overspend Total"]
ID Start End Variance Overspend Total
0 1 100000.00 120000.00 20000.00 20000.0
1 1 1.00 0.00 -1.00 20000.0
2 1 7815.58 7815.58 0.00 20000.0
3 1 5261.00 5261.00 0.00 20000.0
4 1 138783.20 89969.37 -48813.83 20000.0
5 1 2459.92 2459.92 0.00 20000.0
6 2 101421.99 93387.45 -8034.54 3000.0
7 2 940.04 940.04 0.00 3000.0
8 2 63.06 63.06 0.00 3000.0
9 2 2454.86 2454.86 0.00 3000.0
10 2 830.00 830.00 0.00 3000.0
11 2 299.00 299.00 0.00 3000.0
12 2 14000.00 12000.00 2000.00 3000.0
13 2 1500.00 500.00 1000.00 3000.0
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Henry Ecker |
Solution 2 | galaxyan |