'Python Pandas - Lookup a variable column depending on another column's value
I'm trying to use the value of one cell to find the value of a cell in another column. The first cell value ('source') dictates which column to lookup.
import pandas as pd
df = pd.DataFrame({'A': ['John', 'Andrew', 'Bob', 'Fred'], 'B': [
'Fred', 'Simon', 'Andrew', 'Andrew'], 'source': ['A', 'B', 'A', 'B']}, )
print(df)
A B source
0 John Fred A
1 Andrew Simon B
2 Bob Andrew A
3 Fred Andrew B
My required output value in the 'output' column is a lookup of the 'source':
A B source output
0 John Fred A John
1 Andrew Simon B Simon
2 Bob Andrew A Bob
3 Fred Andrew B Andrew
Failed attempts
df['output'] = df[df['source']]
This results in a ValueError: Wrong number of items passed 4, placement implies 1 because the df['source']
passes in a Series, not a string. I tried converting to a string using:
df['output'] = df[df['source'].convertDTypes(convert_string=True)]
which gave error AttributeError: 'Series' object has no attribute 'convertDTypes'.
Working solution
I found a solution might by iterating through the rows using:
for index, row in df.iterrows():
column = df.loc[index, 'source']
df.at[index, 'output'] = df.loc[index, column]
However, this post suggests iterating is a bad idea. The code doesn't seem very elegant, either.
I feel I've missed something basic here; this really should not be that hard.
Solution 1:[1]
Let us do numpy
way since lookup
will not longer work in the future version
df['new'] = df.values[df.index,df.columns.get_indexer(df.source)]
df
Out[339]:
A B source new
0 John Fred A John
1 Andrew Simon B Simon
2 Bob Andrew A Bob
3 Fred Andrew B Andrew
Solution 2:[2]
Use numpy.where
df['output'] = np.where(df.source == 'A', df.A, df.B)
If you have more columns, use numpy.select
conditions = [df.source == 'A', df.source == 'B']
values = [df.A, df.B]
df['output'] = np.select(conditions, values)
Solution 3:[3]
Stack and then loc with multiindex for recent versions:
df['output'] = df.stack().loc[zip(df.index,df['source'])].droplevel(-1)
or:
df['output'] = (df.stack().loc[pd.MultiIndex.from_arrays((df.index,df['source']))]
.droplevel(1))
For earlier versions of pandas:
df['output'] = df.lookup(df.index,df['source'])
A B source output
0 John Fred A John
1 Andrew Simon B Simon
2 Bob Andrew A Bob
3 Fred Andrew B Andrew
Solution 4:[4]
Try this:
df['output'] = df.apply(lambda x: x[x.source], axis=1)
Output:
A B source output
0 John Fred A John
1 Andrew Simon B Simon
2 Bob Andrew A Bob
3 Fred Andrew B Andrew
Solution 5:[5]
You can also do this simply by enumerate()
,list comprehension and loc[]
accessor
df['output']=[df.loc[x,y] for x,y in enumerate(df['source'])]
Now If you print df
you will get your desired output:
A B source output
0 John Fred A John
1 Andrew Simon B Simon
2 Bob Andrew A Bob
3 Fred Andrew B Andrew
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | BENY |
Solution 2 | |
Solution 3 | |
Solution 4 | ashkangh |
Solution 5 | Anurag Dabas |