'How to subset Pandas Dataframe using an OR operator whilst avoiding "FutureWarning: elementwise comparison failed;"
I have a Pandas dataframe (tempDF) of 5 columns by N rows. Each element of the dataframe is an object (string in this case). For example, the dataframe looks like (this is fake data - not real world):
I have two tuples, each contains a collection of numbers as a string type. For example:
codeset = ('6108','532','98120')
additionalClinicalCodes = ('131','1','120','130')
I want to retrieve a subset of the rows from the tempDF in which the columns "medcode" OR "enttype" have at least one entry in the tuples above. Thus, from the example above, I would retrieve a subset containing rows with the index 8 and 9 and 11.
Until updating some packages earlier today (too many now to work out which has started throwing the warning), this did work:
tempDF = tempDF[tempDF["medcode"].isin(codeSet) | tempDF["enttype"].isin(additionalClinicalCodes)]
But now it is throwing the warning:
FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
mask |= (ar1 == a)
Looking at the API, isin states the the condition "if ALL" is in the iterable collection. I want an "if ANY" condition.
UPDATE #1
The problem lies with using the |
operator, also the np.logical_or
method. If I remove the second isin
condition i.e., just keep tempDF[tempDF["medcode"].isin(codeSet)
then no warning is thrown but I'm only subsetting on the one possible condition.
Solution 1:[1]
import numpy as np
tempDF = tempDF[np.logical_or(tempDF["medcode"].isin(codeSet), tempDF["enttype"].isin(additionalClinicalCodes))
Solution 2:[2]
I'm unable to reproduce your warning (I assume you are using an outdated numpy version), however I believe it is related to the fact that your enttype
column is a numerical type, but you're using strings in additionalClinicalCodes
.
Solution 3:[3]
Try this:
tempDF = temp[temp["medcode"].isin(list(codeset)) | temp["enttype"].isin(list(additionalClinicalCodes))]
Solution 4:[4]
Boiling your question down to an executable example:
import pandas as pd
tempDF = pd.DataFrame({'medcode': ['6108', '6154', '95744', '98120'], 'enttype': ['99', '131', '372', '372']})
codeset = ('6108','532','98120')
additionalClinicalCodes = ('131','1','120','130')
newDF = tempDF[tempDF["medcode"].isin(codeset) | tempDF["enttype"].isin(additionalClinicalCodes)]
print(newDF)
print("Pandas Version")
print(pd.__version__)
This returns for me
medcode enttype
0 6108 99
1 6154 131
3 98120 372
Pandas Version
1.4.2
Thus I am not able to reproduce your warning.
Solution 5:[5]
This is a numpy
strange behaviour. I think the right way to do this is yours way, but if the warning bothers you, try this:
tempDF = tempDF[
(
tempDF.medcode.isin(codeset).astype(int) +
tempDF.isin(additionalClinicalCode).astype(int)
) >= 1
]
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | safay |
Solution 2 | Xnot |
Solution 3 | Alex |
Solution 4 | jugi |
Solution 5 | Andrey Naradzetski |