'How to give dataframe inside isin()
I want convert a sql query like
SELECT * FROM df WHERE id IN (SELECT id FROM an_df)
into dask equivalent. So, I am trying this:
d=df[df['id'].isin(an_df['id'])]
But it is thorwing NotImplementedError: passing a 'dask.datframe.core.DataFrame' to'isin'
Then I converted this an_df['id'] to list like
d=df[df['id'].isin(list(an_df['id']))] or d=df[df['id'].isin(an_df['id'].compute())]
but this is very time consuming.
I want a solution that works as fast as dask. df has approximately 100 million rows.
Please help me with it. Thanks
Solution 1:[1]
I recommend adding a minimal reproducible example, which will make solving this particular issue easier:
https://stackoverflow.com/help/minimal-reproducible-example
It seems like you are converting the pandas.core.series.Series
object returned by an_df['id'].compute()
to a list
, which is not needed. isin()
will take a pandas series or dataframe object as argument. Please see:
https://docs.dask.org/en/latest/generated/dask.dataframe.DataFrame.isin.html
In your example this should work:
series_to_match = an_df['id'].compute()
d=df[df['id'].isin(series_to_match)]
So you can omit the .to_list()
cast. I expect this to be a little bit faster since that type casting can be dropped. But there are still things you need to consider here. Depending on the size of an_df['id'].compute()
you may run into trouble since that statement is pulling the resulting series
object into the memory of the machine your scheduler is running on.
if this series is small enough you could try to use a client.scatter
to make sure all of your workers have that series persisted in memory, see:
http://distributed.dask.org/en/stable/locality.html
If that series
is a huge object you'll have to tackle this differently.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | matthiasdv |