'pandas: multiple conditions while indexing data frame - unexpected behavior
I am filtering rows in a dataframe by values in two columns.
For some reason the OR operator behaves like I would expect AND operator to behave and vice versa.
My test code:
import pandas as pd
df = pd.DataFrame({'a': range(5), 'b': range(5) })
# let's insert some -1 values
df['a'][1] = -1
df['b'][1] = -1
df['a'][3] = -1
df['b'][4] = -1
df1 = df[(df.a != -1) & (df.b != -1)]
df2 = df[(df.a != -1) | (df.b != -1)]
print pd.concat([df, df1, df2], axis=1,
keys = [ 'original df', 'using AND (&)', 'using OR (|)',])
And the result:
original df using AND (&) using OR (|)
a b a b a b
0 0 0 0 0 0 0
1 -1 -1 NaN NaN NaN NaN
2 2 2 2 2 2 2
3 -1 3 NaN NaN -1 3
4 4 -1 NaN NaN 4 -1
[5 rows x 6 columns]
As you can see, the AND
operator drops every row in which at least one value equals -1
. On the other hand, the OR
operator requires both values to be equal to -1
to drop them. I would expect exactly the opposite result. Could anyone explain this behavior, please?
I am using pandas 0.13.1.
Solution 1:[1]
As you can see, the AND operator drops every row in which at least one value equals -1. On the other hand, the OR operator requires both values to be equal to -1 to drop them.
That's right. Remember that you're writing the condition in terms of what you want to keep, not in terms of what you want to drop. For df1
:
df1 = df[(df.a != -1) & (df.b != -1)]
You're saying "keep the rows in which df.a
isn't -1 and df.b
isn't -1", which is the same as dropping every row in which at least one value is -1.
For df2
:
df2 = df[(df.a != -1) | (df.b != -1)]
You're saying "keep the rows in which either df.a
or df.b
is not -1", which is the same as dropping rows where both values are -1.
PS: chained access like df['a'][1] = -1
can get you into trouble. It's better to get into the habit of using .loc
and .iloc
.
Solution 2:[2]
Solution 3:[3]
A little mathematical logic theory here:
"NOT a AND NOT b" is the same as "NOT (a OR b)", so:
"a NOT -1 AND b NOT -1" is equivalent of "NOT (a is -1 OR b is -1)", which is opposite (Complement) of "(a is -1 OR b is -1)".
So if you want exact opposite result, df1 and df2 should be as below:
df1 = df[(df.a != -1) & (df.b != -1)]
df2 = df[(df.a == -1) | (df.b == -1)]
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | DSM |
Solution 2 | |
Solution 3 | automa7 |