'Splitting and grouping pandas into intervals and calculating mean based on different column
I have a well-known Titanic dataset and I am trying to find the survival probability of a person, based on their age and sex. The input I am given is the number of intervals the dataset is gonna be split into (it's going to be split based on Age), age, and sex. Also, some data for Age is missing, so I should fill it with the mean value of other Age records.
The created dataset should look someway like this.
"AgeInterval" | "Sex" | "Survival Probability" |
---|---|---|
(1.977, 13.5] | "male" | 0.21 |
(1.977, 13.5] | "female" | 0.28 |
(13.5, 25.0] | "male" | 0.10 |
(13.5, 25.0] | "female" | 0.15 |
From this, I have to find the probability based on age and sex.
So far I've tried:
df = df.fillna(df["Age"].mean())
to fill the values
df["AgeInterval"] = pd.cut(df.Age, bins=n_interval, right=True)
to create the intervals
df = df.groupby(['AgeInterval', 'Sex'])
to group the intervals along with sex,
df = df.agg({'Survived' : 'mean'})
to calculate mean of Survived
Although this is giving me some results, the results are wrong and I can't find the right solution for this problem.
Another thing is getting the value. To which I tried with the following:
result = df.loc[(df["AgeInterval"]==age)&(df["Sex"]==sex)]
But this only raises KeyError
. I don't know why, because when I print df
, I can see AgeInterval
and Sex
.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|