'KeyError: "None of [Index(['', ''], dtype='object')] are in the [columns]" when trying to select columns on a dask dataframe

I am creating a dask dataframe from a pandas dataframe using the from_pandas() function. When I try to select two columns from the dask dataframe using the square brackets [[ ]], I am getting a KeyError.

According to dask documentation, the dask dataframe supports the square bracket column selection like the pandas dataframe.

# data is a pandas dataframe
dask_df = ddf.from_pandas(data, 30)

data = data[dask_df[['length', 'country']].apply(
           lambda x: myfunc(x, countries),
           meta=('Boolean'),
           axis=1
       ).compute()].reset_index(drop=True)

This is the error I am getting:

KeyError: "None of [Index(['length', 'country'], dtype='object')] are in the [columns]"

I was thinking that this might be something to do with providing the correct meta for the apply, but from the error it seems like the dask dataframe is not able to select the two columns, which should happen before the apply.

This works perfectly with if I replace "dask_df" with "data"(pandas df) in the apply line.

Is the index not being preserved when I am doing the from_pandas?



Solution 1:[1]

Try loading less data at once.

I had the same issue, but when I loaded only a subset of my data, it worked.

With the large dataset, I was able to run print(dask_df.columns) and see e.g.

Index(['apple', 'orange', 'pear'], dtype='object', name='fruit').

But when I ran dask_df.compute I would get KeyError: "None of [Index(['apple', 'orange', 'pear'], dtype='object')] are in the [columns]".

I knew that the data set was too big for my memory, and was trying dask hoping it would just figure it out for me =) I guess I have more work to do, but in any case I am glad to be in dask!

Solution 2:[2]

As the error states: columns ['length', 'country']

do not exist in dask_df. Create them first than run your function.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 alh
Solution 2 Hrvoje