'Pandas dataframe count values above threshold using groupby - code optimization

I have a large pandas dataframe where I want to count the number of values above a threshold (zero) in each column grouped by the values in one name column.

The code below does the job, but I wonder if it is unnecessary slow. It takes me more than 60 seconds on my computer.

import pandas as pd
import numpy as np
import time

# Set up problem (also slow, but irrelevant at this point)
n = 40000
m = 200
n_name = 1000
df = pd.DataFrame(np.random.randint(-10, 10, size=(n, m)))
df['name'] = ''
for i in range(n):
    df.loc[i, 'name'] = 'Name_' + str(np.random.randint(0, n_name))


# Slow code
t0 = time.time()
number_above_zero = df.groupby(by='name').apply(lambda x: x[x > 0].count())
t1 = time.time()
print('Computation time {} seconds.'.format(np.round(t1 - t0, 1)))


Solution 1:[1]

Fast generate data:

n = 40000
m = 200
n_name = 1000
df = pd.DataFrame(np.random.randint(-10, 10, size=(n, m)))
# first generate random values, then add them to df all at once
df['name'] = (np.random.randint(0, n_name, n)).astype(str)
df['name'] = 'Name_' + df['name']

(Not very) fast find positive values by group:

df.groupby('name').apply(lambda x: (x.iloc[:, :-1] > 0).sum())

EDIT: Here is a vectorised solution that is about eight times quicker:

df.set_index('name').gt(0).groupby('name').sum()

Timing:

    def slow_f(df):
        return df.groupby('name').apply(lambda x: (x > 0).sum())
    
    def fast_f(df):
        return df.set_index('name').gt(0).groupby('name').sum()

%%timeit 
slow_f(df) 
360 ms ± 3.63 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
fast_f(df)   
44.1 ms ± 814 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1