'reshaping the dataset in python
I have this dataset:
Account | lookup | FY11USD | FY12USD | FY11local | FY12local |
---|---|---|---|---|---|
Sales | CA | 1000 | 5000 | 800 | 4800 |
Sales | JP | 5000 | 6500 | 10 | 15 |
Trying to arrive to get the data in this format: (below example has 2 years of data but no. of years can vary)
Account | lookup | Year | USD | Local |
---|---|---|---|---|
Sales | CA | FY11 | 1000 | 800 |
Sales | CA | FY12 | 5000 | 4800 |
Sales | JP | FY11 | 5000 | 10 |
Sales | JP | FY12 | 6500 | 15 |
I tried using the below script, but it doesn't segregate USD and local for the same year. How should I go about that?
df.melt(id_vars=["Account", "lookup"],
var_name="Year",
value_name="Value")
Solution 1:[1]
One efficient option is to transform to long form with pivot_longer from pyjanitor, using the .value
placeholder ---> the .value
determines which parts of the columns remain as headers:
# pip install pyjanitor
import pandas as pd
import janitor
df.pivot_longer(
index = ['Account', 'lookup'],
names_to = ('Year', '.value'),
names_pattern = r"(FY\d+)(.+)")
Account lookup Year USD local
0 Sales CA FY11 1000 800
1 Sales JP FY11 5000 10
2 Sales CA FY12 5000 4800
3 Sales JP FY12 6500 15
Another option is to use stack:
temp = df.set_index(['Account', 'lookup'])
temp.columns = temp.columns.str.split('(FY\d+)', expand = True).droplevel(0)
temp.columns.names = ['Year', None]
temp.stack('Year').reset_index()
Account lookup Year USD local
0 Sales CA FY11 1000 800
1 Sales CA FY12 5000 4800
2 Sales JP FY11 5000 10
3 Sales JP FY12 6500 15
You can also pull it off with pd.wide_to_long
after reshaping the columns:
index = ['Account', 'lookup']
temp = df.set_index(index)
temp.columns = (temp
.columns
.str.split('(FY\d+)')
.str[::-1]
.str.join('')
)
(pd.wide_to_long(
temp.reset_index(),
stubnames = ['USD', 'local'],
i = index,
j = 'Year',
suffix = '.+')
.reset_index()
)
Account lookup Year USD local
0 Sales CA FY11 1000 800
1 Sales CA FY12 5000 4800
2 Sales JP FY11 5000 10
3 Sales JP FY12 6500 15
Solution 2:[2]
You can piece it together like so:
dfn = (pd.concat(
[df[["Account", "lookup", 'FY11USD','FY12USD']].melt(id_vars=["Account", "lookup"], var_name="Year", value_name="USD"),
df[["Account", "lookup", 'FY11local','FY12local']].melt(id_vars=["Account", "lookup"], var_name="Year", value_name="Local")[['Local']]], axis=1 ))
dfn['Year'] = dfn['Year'].str[:4]
Output
Account lookup Year USD Local
0 Sales CA FY11 1000 800
1 Sales JP FY11 5000 10
2 Sales CA FY12 5000 4800
3 Sales JP FY12 6500 15
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | Jonathan Leon |