'Most efficient way to search over a DataFrame in Python [duplicate]
I have a DataFrame having these kind of data :
df = pd.DataFrame({
'id' : ['a', 'a', 'b', 'b', 'c', 'c'],
'alias' : ['value'+str(i) for i in range(6)],
'source' : ['src1', 'src2', 'src1', 'src2', 'src1', 'src3']
})
print(df)
output :
id alias source
0 a value0 src1
1 a value1 src2
2 b value2 src1
3 b value3 src2
4 c value4 src1
5 c value5 src3
And i want to change the structure of the dataframe to get somthing like this :
src1 src2 src3
a value0 value1 None
b value2 value3 None
c value4 None value5
So to process this task i made this function :
def process_aliases(df):
sources = set(df['source'])
indexes = list(set(df['id']))
cols = {source:[] for source in sources}
for index in indexes:
subdf = df[df['id']==index]
for source in sources:
alias = subdf[df['source'] == source]['alias']
cols[source].append(alias.iloc[0].strip() if len(alias) == 1 else None)
return pd.DataFrame(cols, index=indexes)
My problem is that my dataframe contains more than 1 000 000 lines. So this function take too much time to be processed. More than 1 hour of execution time using Google Colab.
I don't know if my function is the best way to do this. So, i'm asking for help.
What is the best way to accomplish this task ? Or, what is the fastest way to search for elements over a DataFrame ?
Solution 1:[1]
I believe you want to pivot
:
import pandas as pd
df = pd.DataFrame({
'id' : ['a', 'a', 'b', 'b', 'c', 'c'],
'alias' : ['value'+str(i) for i in range(6)],
'source' : ['src1', 'src2', 'src1', 'src2', 'src1', 'src3']
})
print(df.pivot(index='id', columns='source'))
alias
source src1 src2 src3
id
a value0 value1 NaN
b value2 value3 NaN
c value4 NaN value5
If you are strictly requiring None
instead of NaN
for some reason, you can do:
import numpy as np
print(df.pivot(index='id', columns='source').replace([np.nan], [None]))
alias
source src1 src2 src3
id
a value0 value1 None
b value2 value3 None
c value4 None value5
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | user2246849 |