'How do I get rid of abnormalities from Pandas?
If I want to remove values that do not exist between -2σ and 2σ, how do I remove outliers using iqr?
I implemented this equation as follows.
iqr = df['abc'].percentile(0.75) - df['abc'].percentile(0.25)
cond1 = (df['abc'] > df['abc'].percentile(0.75) + 2 * iqr)
cond2 = (df['abc'] < df['abc'].percentile(0.25) - 2 * iqr)
df[cond1 & cond2]
Is this the right way?
Solution 1:[1]
This is not right. Your iqr
is almost never equal to ?. Quartiles and deviations are not the same things.
Fortunately, you can easily compute the standard deviation of a numerical Series using Series.std()
.
sigma = df['abc'].std()
cond1 = (df['abc'] > df['abc'].mean() - 2 * sigma)
cond2 = (df['abc'] < df['abc'].mean() + 2 * sigma)
df[cond1 & cond2]
Solution 2:[2]
You can use neulab Python library (https://pypi.org/project/neulab)
There is several methods to detect and to delete outliers. For example Chauvenet Algorithm:
from neulab.OutlierDetection import Chauvenet
d = {'col1': [8.02, 8.16, 3.97, 8.64, 0.84, 4.46, 0.81, 7.74, 8.78, 9.26, 20.46, 29.87, 10.38, 25.71], 'col2': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
df = pd.DataFrame(data=d)
chvn = Chauvenet(dataframe=df, info=True, autorm=True)
Output: Detected outliers: {'col1': [29.87, 25.71, 20.46, 0.84, 0.81, 3.97, 4.46, 10.38, 7.74, 9.26]}
col1 col2
0 8.02 1
1 8.16 1
3 8.64 1
8 8.78 1
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Benjamin Rio |
Solution 2 | kndahl |