'Melting pandas data frame with multiple variable names and multiple value names
How can I melt a pandas data frame using multiple variable names and values? I have the following data frame that changes its shape in a for loop. In one of the for loop iterations, it looks like this:
ID Cat Class_A Class_B Prob_A Prob_B
1 Veg 1 2 0.9 0.1
2 Veg 1 2 0.8 0.2
3 Meat 1 2 0.6 0.4
4 Meat 1 2 0.3 0.7
5 Veg 1 2 0.2 0.8
I need to melt it in such a way that it looks like this:
ID Cat Class Prob
1 Veg 1 0.9
1 Veg 2 0.1
2 Veg 1 0.8
2 Veg 2 0.2
3 Meat 1 0.6
3 Meat 2 0.4
4 Meat 1 0.3
4 Meat 2 0.7
5 Veg 1 0.2
5 Veg 2 0.8
During the for loop the data frame will contain different number of classes with their probabilities. That is why I am looking for a general approach that is applicable in all my for loop iterations. I saw this question and this but they were not helpful!
Solution 1:[1]
You need lreshape
by dict
for specify categories:
d = {'Class':['Class_A', 'Class_B'], 'Prob':['Prob_A','Prob_B']}
df = pd.lreshape(df,d)
print (df)
Cat ID Class Prob
0 Veg 1 1 0.9
1 Veg 2 1 0.8
2 Meat 3 1 0.6
3 Meat 4 1 0.3
4 Veg 5 1 0.2
5 Veg 1 2 0.1
6 Veg 2 2 0.2
7 Meat 3 2 0.4
8 Meat 4 2 0.7
9 Veg 5 2 0.8
More dynamic solution:
Class = [col for col in df.columns if col.startswith('Class')]
Prob = [col for col in df.columns if col.startswith('Prob')]
df = pd.lreshape(df, {'Class':Class, 'Prob':Prob})
print (df)
Cat ID Class Prob
0 Veg 1 1 0.9
1 Veg 2 1 0.8
2 Meat 3 1 0.6
3 Meat 4 1 0.3
4 Veg 5 1 0.2
5 Veg 1 2 0.1
6 Veg 2 2 0.2
7 Meat 3 2 0.4
8 Meat 4 2 0.7
9 Veg 5 2 0.8
EDIT:
lreshape
is now undocumented, but is possible in future will by removed (with pd.wide_to_long too).
Possible solution is merging all 3 functions to one - maybe melt
, but now it is not implementated. Maybe in some new version of pandas. Then my answer will be updated.
Solution 2:[2]
Or you can try this by using str.contain
and pd.concat
DF1=df2.loc[:,df2.columns.str.contains('_A|Cat|ID')]
name=['ID','Cat','Class','Prob']
DF1.columns=name
DF2=df2.loc[:,df2.columns.str.contains('_B|Cat|ID')]
DF2.columns=name
pd.concat([DF1,DF2],axis=0)
Out[354]:
ID Cat Class Prob
0 1 Veg 1 0.9
1 2 Veg 1 0.8
2 3 Meat 1 0.6
3 4 Meat 1 0.3
4 5 Veg 1 0.2
0 1 Veg 2 0.1
1 2 Veg 2 0.2
2 3 Meat 2 0.4
3 4 Meat 2 0.7
4 5 Veg 2 0.8
Solution 3:[3]
The top voted answer uses the undocumented lreshape
which may at some point get deprecated because of its similarity to pd.wide_to_long
which is documented and can use directly here. By default suffix
matches only to numbers. You must change this to match characters (here I just used any character).
pd.wide_to_long(df, stubnames=['Class', 'Prob'], i=['ID', 'Cat'], j='DROPME', suffix='.')\
.reset_index()\
.drop('DROPME', axis=1)
ID Cat Class Prob
0 1 Veg 1 0.9
1 1 Veg 2 0.1
2 2 Veg 1 0.8
3 2 Veg 2 0.2
4 3 Meat 1 0.6
5 3 Meat 2 0.4
6 4 Meat 1 0.3
7 4 Meat 2 0.7
8 5 Veg 1 0.2
9 5 Veg 2 0.8
Solution 4:[4]
You could also use pd.melt
.
# Make DataFrame
df = pd.DataFrame({'ID' : [i for i in range(1,6)],
'Cat' : ['Veg']*2 + ['Meat']*2 + ['Veg'],
'Class_A' : [1]*5,
'Class_B' : [2]*5,
'Prob_A' : [0.9, 0.8, 0.6, 0.3, 0.2],
'Prob_B' : [0.1, 0.2, 0.4, 0.7, 0.8]})
# Make class dataframe and prob dataframe
df_class = df.loc[:, ['ID', 'Cat', 'Class_A', 'Class_B']]
df_prob = df.loc[:, ['ID', 'Cat', 'Prob_A', 'Prob_B']]
# Melt class dataframe and prob dataframe
df_class = df_class.melt(id_vars = ['ID',
'Cat'],
value_vars = ['Class_A',
'Class_B'],
value_name = 'Class')
df_prob = df_prob.melt(id_vars = ['ID',
'Cat'],
value_vars = ['Prob_A',
'Prob_B'],
value_name = 'Prob')
# Clean variable column so only 'A','B' is left in both dataframes
df_class.loc[:, 'variable'] = df_class.loc[:, 'variable'].str.partition('_')[2]
df_prob.loc[:, 'variable'] = df_prob.loc[:, 'variable'].str.partition('_')[2]
# Merge class dataframe with prob dataframe on 'ID', 'Cat', and 'variable';
# drop 'variable'; sort values by 'ID', 'Cat'
final = df_class.merge(df_prob,
how = 'inner',
on = ['ID',
'Cat',
'variable']).drop('variable', axis = 1).sort_values(by = ['ID',
'Cat'])
Solution 5:[5]
One option is pivot_longer from pyjanitor, which abstracts the process, and is efficient:
# pip install janitor
import janitor
df.pivot_longer(
index = ['ID', 'Cat'],
names_to = '.value',
names_pattern = '([a-zA-Z]+)_*')
ID Cat Class Prob
0 1 Veg 1 0.9
1 2 Veg 1 0.8
2 3 Meat 1 0.6
3 4 Meat 1 0.3
4 5 Veg 1 0.2
5 1 Veg 2 0.1
6 2 Veg 2 0.2
7 3 Meat 2 0.4
8 4 Meat 2 0.7
9 5 Veg 2 0.8
The idea for this particular reshape is that whatever group in the regular expression is paired with the .value
stays as the column header.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | BENY |
Solution 3 | Ted Petrou |
Solution 4 | Ian Thompson |
Solution 5 | sammywemmy |