'Select two sets of columns by column names in Pandas
Take the DataFrame
in the answer of Loc vs. iloc vs. ix vs. at vs. iat? for example.
df = pd.DataFrame(
{'age':[30, 2, 12, 4, 32, 33, 69],
'color':['blue', 'green', 'red', 'white', 'gray', 'black', 'red'],
'food':['Steak', 'Lamb', 'Mango', 'Apple', 'Cheese', 'Melon', 'Beans'],
'height':[165, 70, 120, 80, 180, 172, 150],
'score':[4.6, 8.3, 9.0, 3.3, 1.8, 9.5, 2.2],
'state':['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']},
index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean', 'Christina', 'Cornelia']
)
Now I want all columns except 'food' and 'height'.
I thought something like df.loc[:,['age':'color', 'score':'state']]
would work, but Python returns SyntaxError: invalid syntax
.
I am aware of that there is one way to work around: df.drop(columns = ['food', 'height'])
. However, in my real life situation, I have hundreds of columns to be dropped. Typing out all column names is so inefficient.
I am expecting something similar with dplyr::select(df, -(food:height))
or dplyr::select(df, age:color, score:state)
in R language.
Also have read Selecting/Excluding sets of columns in Pandas.
Solution 1:[1]
First, find all columns lying between food
and height
(inclusive).
c = df.iloc[-1:0].loc[:, 'food':'height'].columns
Next, filter with difference
/isin
/setdiff1d
-
df[df.columns.difference(c)]
Or,
df.loc[:, ~df.columns.isin(c)]
Or,
df[np.setdiff1d(df.columns, c)]
age color score state
Jane 30 blue 4.6 NY
Nick 2 green 8.3 TX
Aaron 12 red 9.0 FL
Penelope 4 white 3.3 AL
Dean 32 gray 1.8 AK
Christina 33 black 9.5 TX
Cornelia 69 red 2.2 TX
Solution 2:[2]
First get positions of columns names by Index.get_loc
and then use numpy.r_
for join all slicers together:
a = np.r_[df.columns.get_loc('age'):df.columns.get_loc('color')+1,
df.columns.get_loc('score'):df.columns.get_loc('state')+1]
df = df.iloc[:, a]
print (df)
age color score state
Jane 30 blue 4.6 NY
Nick 2 green 8.3 TX
Aaron 12 red 9.0 FL
Penelope 4 white 3.3 AL
Dean 32 gray 1.8 AK
Christina 33 black 9.5 TX
Cornelia 69 red 2.2 TX
Solution 3:[3]
One option for flexible column selection is with select_columns from pyjanitor:
# pip install pyjanitor
import pandas as pd
import janitor
df.select_columns(slice('age', 'color'), slice('score', 'state'))
age color score state
Jane 30 blue 4.6 NY
Nick 2 green 8.3 TX
Aaron 12 red 9.0 FL
Penelope 4 white 3.3 AL
Dean 32 gray 1.8 AK
Christina 33 black 9.5 TX
Cornelia 69 red 2.2 TX
df.select_columns(slice('food', 'height'), invert = True)
age color score state
Jane 30 blue 4.6 NY
Nick 2 green 8.3 TX
Aaron 12 red 9.0 FL
Penelope 4 white 3.3 AL
Dean 32 gray 1.8 AK
Christina 33 black 9.5 TX
Cornelia 69 red 2.2 TX
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | |
Solution 3 | sammywemmy |