'pandas - filter on groups which have at least one column containing non-null values in a groupby

I have the following python pandas dataframe:

df = pd.DataFrame({'Id': ['1', '1', '1', '2', '2', '3'], 'A': ['TRUE', 'TRUE', 'TRUE', 'TRUE', 'TRUE', 'FALSE'], 'B': [np.nan, np.nan, 'abc', np.nan, np.nan, 'def'],'C': [np.nan, np.nan, np.nan, np.nan, np.nan, '456']})

>>> print(df)
  Id      A    B    C
0  1   TRUE  NaN  NaN
1  1   TRUE  NaN  NaN
2  1   TRUE  abc  NaN
3  2   TRUE  NaN  NaN
4  2   TRUE  NaN  NaN
5  3  FALSE  def  456

I want to end up with the following dataframe:

>>> print(dfout)
  Id     A    B   C
0  1  TRUE  abc NaN

The same Id value can appear on multiple rows. Each Id will either have the value TRUE or FALSE in column A consistently on all its rows. Columns B and C can have any value, including NaN.
I want one row in dfout for each Id that has A=TRUE and show the max value seen in columns B and C. But if the only values seen in columns B and C = NaN for all of an Id's rows, then that Id is to be excluded from dfout.

  • Id 1 has A=TRUE, and has B=abc in its third row, so it meets the requirements.
  • Id 2 has A=TRUE, but columns B and C are NaN for both its rows, so it does not.
  • Id 3 has A=FALSE, so it does not meet requirements.

I created a groupby df on Id, then applied a mask to only include rows with A=TRUE. But having trouble understanding how to remove the rows with NaN for all rows in columns B and C.

grouped = df.groupby(['Id'])
mask = grouped['A'].transform(lambda x: 'TRUE' == x.max()).astype(bool)
df.loc[mask].reset_index(drop=True)

  Id     A    B    C
0  1  TRUE  NaN  NaN
1  1  TRUE  NaN  NaN
2  1  TRUE  abc  NaN
3  2  TRUE  NaN  NaN
4  2  TRUE  NaN  NaN

Then I tried several things along the lines of:

df.loc[mask].reset_index(drop=True).all(['B'],['C']).isnull

But getting errors, like:

" TypeError: unhashable type: 'list' ".

Using python 3.6, pandas 0.23.0; Looked here for help: keep dataframe rows meeting a condition into each group of the same dataframe grouped by



Solution 1:[1]

The solution has three parts to it.

  1. Filter dataframe to keep rows where column A is True

  2. Groupby Id and use first which will return first not null value

  3. Use dropna on the resulting dataframe on columns B and C with how = 'all'

    df.loc[df['A'] == True].groupby('Id', as_index = False).first().dropna(subset = ['B', 'C'], how = 'all')

        Id  A       B   C
    0   1   True    abc NaN
    

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1