'How to select multiple columns by name that are not adjacent and combining several slicing methods?
Is there a way to select multiple columns that are not adjacent and combining several methods?
test dataframe:
test = pd.DataFrame(np.random.rand(3, 9),
columns=['ID', 'rfm_snittbeløp_gaver', 'rfm_maksbeløp_gaver', 'rfm_antall_kampanjer',
'a','b','c','d','e'])
Let's say I want the columns: ID, all columns that start with rfm, a:c, and e. In this order.
I thought something along these lines would do it, but I was not able to make it work
frames = [test.loc[:, 'ID'],
test.loc[:, test.columns.str.startswith('rfm')],
test.loc[:, 'a':'c'],
test.iloc[:, -1]]
test_sub = pd.concat(frames)
I read it resets the index and that I won't be able to control the order of the columns.
Preferably there would be something like np.r_
for .loc
to combine slices as found in this post
But I do not like using index position when I refer to columns
Any help is much appreciated
Solution 1:[1]
here is a way using np.r_
and get_loc()
and get_indexer()
:
ID= test.columns.get_loc('ID')
rfm=test.columns.get_indexer(test.columns[test.columns.str.startswith('rfm')])
a=test.columns.get_loc('a')
c=test.columns.get_loc('c')
e=test.columns.get_loc('e')
test.iloc[:,np.r_[ID,rfm,a:c+1,e]]
ID rfm_snittbeløp_gaver rfm_maksbeløp_gaver rfm_antall_kampanjer \
0 0.822275 0.155649 0.189058 0.050138
1 0.188038 0.286731 0.509774 0.171374
2 0.626211 0.477937 0.585987 0.358124
a b c e
0 0.652142 0.492184 0.464453 0.361395
1 0.242480 0.963673 0.898177 0.813195
2 0.863088 0.781858 0.924203 0.690219
Solution 2:[2]
You are pretty close, with axis=1
:
frames = [test.loc[:, 'ID'],
test.loc[:, test.columns.str.startswith('rfm')],
test.loc[:, 'a':'c'],
test.iloc[:, -1]]
test_sub = pd.concat(frames, axis=1)
Solution 3:[3]
A more general approach will be to create a class like the following:
class s_:
"""Create slices from given columns resembling numpy s_"""
def __init__(self, cols):
self.indices = pd.Series(cols, index=cols)
def __getitem__(self, item):
if isinstance(item, tuple):
result = []
for indexing in item:
try:
if isinstance(indexing, str): # treat strings as labels
result.append(self.indices[indexing])
else:
result.extend(self.indices[indexing])
except TypeError:
result.append(self.indices[indexing])
return result
else:
return self.indices[item]
You could do then:
columns = ['ID', 'rfm_snittbeløp_gaver', 'rfm_maksbeløp_gaver', 'rfm_antall_kampanjer', 'a', 'b', 'c', 'd', 'e']
test = pd.DataFrame(np.random.rand(3, 9), columns=columns)
print(test.loc[:, s_(test.columns)['ID', test.columns.str.startswith('rfm'), 'a':'c', -1]])
Output
ID rfm_snittbeløp_gaver ... c e
0 0.026803 0.603409 ... 0.819486 0.396006
1 0.791049 0.450502 ... 0.097529 0.708746
2 0.623558 0.513678 ... 0.140740 0.958713
[3 rows x 8 columns]
Note that this also works for single indexing:
print(test.loc[:, s_(test.columns)['ID']])
Output
0 0.129801
1 0.786684
2 0.839015
Name: ID, dtype: float64
Solution 4:[4]
One option for flexible column selection is with select_columns from pyjanitor:
# pip install pyjanitor
import pandas as pd
import janitor
test.select_columns('ID', 'rfm*', slice('a', 'c'), 'e')
ID rfm_snittbeløp_gaver rfm_maksbeløp_gaver rfm_antall_kampanjer a b c e
0 0.260244 0.890904 0.884521 0.600843 0.955812 0.213352 0.628574 0.428940
1 0.859678 0.330782 0.200329 0.774217 0.421452 0.494960 0.932772 0.454032
2 0.483668 0.894447 0.088583 0.806648 0.797400 0.240357 0.762199 0.069686
The *
inrfm*
ida is borrowed from unix.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | anky |
Solution 2 | Quang Hoang |
Solution 3 | Dani Mesejo |
Solution 4 | sammywemmy |