'Add a column to pandas dataframe containing the proportions for a particular column, based on grouping column
I have some data for which I want to do the following:
- group by a set of columns G
- for each grouping find the proportion of a particular column within the group
- return the full data with the additional proportion column
I'm not sure what a decent approach to this is though, this is something that I tried:
data = pd.DataFrame(
{
"x": [1, 2, 3, 4] + [4, 5, 6, 7],
"y": ["a"] * 4 + ["b"] * 4,
}
)
gives
x y
0 1 a
1 2 a
2 3 a
3 4 a
4 4 b
5 5 b
6 6 b
7 7 b
then
pd.concat(
[
data,
data.groupby("y")
.apply(lambda df: df["x"].div(df["x"].sum()))
.reset_index()
.rename(columns={"x": "proportion"})
.drop(["y", "level_1"], axis=1),
],
axis=1,
)
gives
x y proportion
0 1 a 0.100000
1 2 a 0.200000
2 3 a 0.300000
3 4 a 0.400000
4 4 b 0.181818
5 5 b 0.227273
6 6 b 0.272727
7 7 b 0.318182
Solution 1:[1]
I think you can do it more easily with:
data["proportion"] = data["x"] / data.groupby("y")["x"].transform("sum")
print(data.to_markdown())
Prints:
x | y | proportion | |
---|---|---|---|
0 | 1 | a | 0.1 |
1 | 2 | a | 0.2 |
2 | 3 | a | 0.3 |
3 | 4 | a | 0.4 |
4 | 4 | b | 0.181818 |
5 | 5 | b | 0.227273 |
6 | 6 | b | 0.272727 |
7 | 7 | b | 0.318182 |
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Andrej Kesely |