'How to give dataframe inside isin()

I want convert a sql query like

SELECT * FROM df WHERE id IN (SELECT id FROM an_df)

into dask equivalent. So, I am trying this:

d=df[df['id'].isin(an_df['id'])]

But it is thorwing NotImplementedError: passing a 'dask.datframe.core.DataFrame' to'isin'

Then I converted this an_df['id'] to list like

d=df[df['id'].isin(list(an_df['id']))] or d=df[df['id'].isin(an_df['id'].compute())]

but this is very time consuming.

I want a solution that works as fast as dask. df has approximately 100 million rows.

Please help me with it. Thanks



Solution 1:[1]

I recommend adding a minimal reproducible example, which will make solving this particular issue easier:

https://stackoverflow.com/help/minimal-reproducible-example

It seems like you are converting the pandas.core.series.Series object returned by an_df['id'].compute() to a list, which is not needed. isin() will take a pandas series or dataframe object as argument. Please see:

https://docs.dask.org/en/latest/generated/dask.dataframe.DataFrame.isin.html

In your example this should work:

series_to_match = an_df['id'].compute() 

d=df[df['id'].isin(series_to_match)]

So you can omit the .to_list() cast. I expect this to be a little bit faster since that type casting can be dropped. But there are still things you need to consider here. Depending on the size of an_df['id'].compute() you may run into trouble since that statement is pulling the resulting series object into the memory of the machine your scheduler is running on.

if this series is small enough you could try to use a client.scatter to make sure all of your workers have that series persisted in memory, see:

http://distributed.dask.org/en/stable/locality.html

If that series is a huge object you'll have to tackle this differently.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 matthiasdv