'How do you compare columns 'a' and 'b' to return 'c' or 'd'?
I am trying to compare two columns and then return a third value from one of the two adjacent columns. I have read that using iterrows is not the correct way to accomplish this so I tried making writing my own function. The trouble is figuring out the correct syntax to apply it to the df.
import pandas as pd
d = {'a':[1,2,3], 'b':[4,1,6], 'c':[6,7,8], 'd':[8,9,0]}
df = pd.DataFrame(d)
print(df)
def area_name_final(ms1, ms2, an1, an2):
if ms1 >= ms2:
return an1
else:
return an2
df['e'] = df.apply(area_name_final(df.a, df.b, df.c, df.d), axis=1)
Error:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Desired Output:
a b c d e
0 1 4 6 8 8
1 2 1 7 9 7
2 3 6 8 0 0
Solution 1:[1]
You can try np.where
import numpy as np
df['e'] = np.where(df['a'] >= df['b'], df['c'], df['d'])
print(df)
a b c d e
0 1 4 6 8 8
1 2 1 7 9 7
2 3 6 8 0 0
To fix your code, you need to pass row not the column to apply function
def area_name_final(row):
if row['a'] >= row['b']:
return row['c']
else:
return row['d']
df['e'] = df.apply(area_name_final, axis=1)
Solution 2:[2]
You can use a simple where
condition that will be much more efficient (vectorized) than your custom function:
df['e'] = df['c'].where(df['a'].ge(df['b']), df['d'])
output:
a b c d e
0 1 4 6 8 8
1 2 1 7 9 7
2 3 6 8 0 0
Solution 3:[3]
Using np.where
is definitely a good option. There's another way to do it fancily without calling numpy library.
ddf = pd.MultiIndex.from_frame(df)
result = [i[2] if i[0] >= i[1] else i[3] for i in ddf]
df['e'] = result
df
Out[9]:
a b c d e
0 1 4 6 8 8
1 2 1 7 9 7
2 3 6 8 0 0
As pandas' Multiindex helps to turn all your data in dataframe into rows of tuples, you can easily then compare components in a list/tuple.
Extra
However, of course, np.where
will give you the result faster.
def solution_1(df):
df['e'] = np.where(df['a'] >= df['b'], df['c'], df['d'])
def solution_2(df):
ddf = pd.MultiIndex.from_frame(df)
df['e'] = [i[2] if i[0] >= i[1] else i[3] for i in ddf]
%timeit solution_1(df)
268 µs ± 2.59 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit solution_2(df)
1.6 ms ± 185 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
If you create a global pd.MultIndex dataframe, on the other hand, the solution will be faster.
def solution_3(df):
df['e'] = [i[2] if i[0] >= i[1] else i[3] for i in ddf]
%timeit solution_3(df)
60.5 µs ± 722 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Solution 4:[4]
Another option, probably the closest to what you already had, would be:
df['e'] = df.apply(lambda x: area_name_final(x.a, x.b, x.c, x.d), axis=1)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Ynjxsjmh |
Solution 2 | mozway |
Solution 3 | |
Solution 4 | BeRT2me |