'Pandas dataframe count values above threshold using groupby - code optimization
I have a large pandas dataframe where I want to count the number of values above a threshold (zero) in each column grouped by the values in one name column.
The code below does the job, but I wonder if it is unnecessary slow. It takes me more than 60 seconds on my computer.
import pandas as pd
import numpy as np
import time
# Set up problem (also slow, but irrelevant at this point)
n = 40000
m = 200
n_name = 1000
df = pd.DataFrame(np.random.randint(-10, 10, size=(n, m)))
df['name'] = ''
for i in range(n):
df.loc[i, 'name'] = 'Name_' + str(np.random.randint(0, n_name))
# Slow code
t0 = time.time()
number_above_zero = df.groupby(by='name').apply(lambda x: x[x > 0].count())
t1 = time.time()
print('Computation time {} seconds.'.format(np.round(t1 - t0, 1)))
Solution 1:[1]
Fast generate data:
n = 40000
m = 200
n_name = 1000
df = pd.DataFrame(np.random.randint(-10, 10, size=(n, m)))
# first generate random values, then add them to df all at once
df['name'] = (np.random.randint(0, n_name, n)).astype(str)
df['name'] = 'Name_' + df['name']
(Not very) fast find positive values by group:
df.groupby('name').apply(lambda x: (x.iloc[:, :-1] > 0).sum())
EDIT: Here is a vectorised solution that is about eight times quicker:
df.set_index('name').gt(0).groupby('name').sum()
Timing:
def slow_f(df):
return df.groupby('name').apply(lambda x: (x > 0).sum())
def fast_f(df):
return df.set_index('name').gt(0).groupby('name').sum()
%%timeit
slow_f(df)
360 ms ± 3.63 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
fast_f(df)
44.1 ms ± 814 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 |