'check if pair of values is in pair of columns in pandas
Basically, I have latitude and longitude (on a grid) in two different columns. I am getting fed two-element lists (could be numpy arrays) of a new coordinate set and I want to check if it is a duplicate before I add it.
For example, my data:
df = pd.DataFrame([[4,8, 'wolf', 'Predator', 10],
[5,6,'cow', 'Prey', 10],
[8, 2, 'rabbit', 'Prey', 10],
[5, 3, 'rabbit', 'Prey', 10],
[3, 2, 'cow', 'Prey', 10],
[7, 5, 'rabbit', 'Prey', 10]],
columns = ['lat', 'long', 'name', 'kingdom', 'energy'])
newcoords1 = [4,4]
newcoords2 = [7,5]
Is it possible to write one if
statement to tell me whether there is already a row with that latitude and longitude. In pseudo code:
if newcoords1 in df['lat', 'long']:
print('yes! ' + str(newcoords1))
(In the example, newcoords1
should be false
and newcoords2
should be true
.
Sidenote: (newcoords1[0] in df['lat']) & (newcoords1[1] in df['long'])
doesn't work because that checks them independently, but I need to know if that combination appears in a single row.
Thank you in advance!
Solution 1:[1]
you can do it this way:
In [140]: df.query('@newcoords2[0] == lat and @newcoords2[1] == long')
Out[140]:
lat long name kingdom energy
5 7 5 rabbit Prey 10
In [146]: df.query('@newcoords2[0] == lat and @newcoords2[1] == long').empty
Out[146]: False
the following line will return a number of found rows:
In [147]: df.query('@newcoords2[0] == lat and @newcoords2[1] == long').shape[0]
Out[147]: 1
or using NumPy approach:
In [103]: df[(df[['lat','long']].values == newcoords2).all(axis=1)]
Out[103]:
lat long name kingdom energy
5 7 5 rabbit Prey 10
this will show whether at least one row has been found:
In [113]: (df[['lat','long']].values == newcoords2).all(axis=1).any()
Out[113]: True
In [114]: (df[['lat','long']].values == newcoords1).all(axis=1).any()
Out[114]: False
Explanation:
In [104]: df[['lat','long']].values == newcoords2
Out[104]:
array([[False, False],
[False, False],
[False, False],
[False, False],
[False, False],
[ True, True]], dtype=bool)
In [105]: (df[['lat','long']].values == newcoords2).all(axis=1)
Out[105]: array([False, False, False, False, False, True], dtype=bool)
Solution 2:[2]
for people like me who came here by searching how to check if several pairs of values are in a pair of columns within a big dataframe, here an answer.
Let a list newscoord = [newscoord1, newscoord2, ...]
and you want to extract the rows of df
matching the elements of this list. Then for the example above:
v = pd.Series( [ str(i) + str(j) for i,j in df[['lat', 'long']].values ] )
w = [ str(i) + str(j) for i,j in newscoord ]
df[ v.isin(w) ]
Which gives the same output as @MaxU, but it allows to extract several rows in once.
On my computer, for a df
with 10,000 rows, it takes 0.04s to run.
Of course, if your elements are already strings, it is simpler to use join
instead of concatenation.
Furthermore, if the order of elements in the pair does not matter, you have to sort first:
v = pd.Series( [ str(i) + str(j) for i,j in np.sort( df[['lat','long']] ) ] )
w = [ str(i) + str(j) for i,j in np.sort( newscoord ) ]
To be noted that if v
is not converted into a series and one uses np.isin(v,w)
, or i w
is converted into a series, it would require more run time when newscoord
reaches thousands of elements.
Hope it helps.
Solution 3:[3]
x, y = newcoords1
>>> df[(df.lat == x) & (df.long == y)].empty
True # Coordinates are not in the dataframe, so you can add it.
x, y = newcoords2
>>> df[(df.lat == x) & (df.long == y)].empty
False # Coordinates already exist.
Solution 4:[4]
If you are trying to check several pairs at once, you can put the DataFrame's columns and the values into MultiIndexes and use Index.isin. I believe this is cleaner than concatenating them as strings:
df = pd.DataFrame([[4,8, 'wolf', 'Predator', 10],
[5,6,'cow', 'Prey', 10],
[8, 2, 'rabbit', 'Prey', 10],
[5, 3, 'rabbit', 'Prey', 10],
[3, 2, 'cow', 'Prey', 10],
[7, 5, 'rabbit', 'Prey', 10]],
columns = ['lat', 'long', 'name', 'kingdom', 'energy'])
new_coords = pd.MultiIndex.from_tuples([(4,4), (7,5)])
existing_coords = pd.MultiIndex.from_frame(df[["lat", "long"]])
~new_coords.isin(existing_coords)
>>> array([ True, False])
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | |
Solution 3 | Alexander |
Solution 4 | Xnot |