'Search for "does-not-contain" on a DataFrame in pandas
I've done some searching and can't figure out how to filter a dataframe by
df["col"].str.contains(word)
however I'm wondering if there is a way to do the reverse: filter a dataframe by that set's compliment. eg: to the effect of
!(df["col"].str.contains(word))
Can this be done through a DataFrame
method?
Solution 1:[1]
You can use the invert (~) operator (which acts like a not for boolean data):
new_df = df[~df["col"].str.contains(word)]
where new_df
is the copy returned by RHS.
contains also accepts a regular expression...
If the above throws a ValueError, the reason is likely because you have mixed datatypes, so use na=False
:
new_df = df[~df["col"].str.contains(word, na=False)]
Or,
new_df = df[df["col"].str.contains(word) == False]
Solution 2:[2]
I was having trouble with the not (~) symbol as well, so here's another way from another StackOverflow thread:
df[df["col"].str.contains('this|that')==False]
Solution 3:[3]
You can use Apply and Lambda :
df[df["col"].apply(lambda x: word not in x)]
Or if you want to define more complex rule, you can use AND:
df[df["col"].apply(lambda x: word_1 not in x and word_2 not in x)]
Solution 4:[4]
I hope the answers are already posted
I am adding the framework to find multiple words and negate those from dataFrame.
Here 'word1','word2','word3','word4'
= list of patterns to search
df
= DataFrame
column_a
= A column name from DataFrame df
values_to_remove = ['word1','word2','word3','word4']
pattern = '|'.join(values_to_remove)
result = df.loc[~df['column_a'].str.contains(pattern, case=False)]
Solution 5:[5]
I had to get rid of the NULL values before using the command recommended by Andy above. An example:
df = pd.DataFrame(index = [0, 1, 2], columns=['first', 'second', 'third'])
df.ix[:, 'first'] = 'myword'
df.ix[0, 'second'] = 'myword'
df.ix[2, 'second'] = 'myword'
df.ix[1, 'third'] = 'myword'
df
first second third
0 myword myword NaN
1 myword NaN myword
2 myword myword NaN
Now running the command:
~df["second"].str.contains(word)
I get the following error:
TypeError: bad operand type for unary ~: 'float'
I got rid of the NULL values using dropna() or fillna() first and retried the command with no problem.
Solution 6:[6]
Additional to nanselm2's answer, you can use 0
instead of False
:
df["col"].str.contains(word)==0
Solution 7:[7]
To negate your query use ~
. Using query
has the advantage of returning the valid observations of df
directly:
df.query('~col.str.contains("word").values')
Solution 8:[8]
To compliment to the above question, if someone wants to remove all the rows with strings, one could do:
df_new=df[~df['col_name'].apply(lambda x: isinstance(x, str))]
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | Shaido |
Solution 3 | |
Solution 4 | |
Solution 5 | |
Solution 6 | U12-Forward |
Solution 7 | rachwa |
Solution 8 |