'Wide to long data transform in pandas
I have a dataset in the following format:
county area pop_2006 pop_2007 pop_2008
01001 275 1037 1052 1102
01003 394 2399 2424 2438
01005 312 1638 1647 1660
And I need it in a format like this:
county year pop area
01001 2006 1037 275
01001 2007 1052 275
01001 2008 1102 275
01003 2006 2399 394
01003 2007 2424 394
...
I've tried every combination of pivot_table, stack, unstack, wide_to_long that I can think of, with no success yet. (clearly I'm mostly illiterate in Python/pandas, so please be gentle...).
Solution 1:[1]
You can use melt
for reshaping, then split
column variable
and drop
and sort_values
. I think you can cast column year
to int
by astype
and last change order of columns by subset
:
df1 = (pd.melt(df, id_vars=['county','area'], value_name='pop'))
df1[['tmp','year']] = df1.variable.str.split('_', expand=True)
df1 = df1.drop(['variable', 'tmp'],axis=1).sort_values(['county','year'])
df1['year'] = df1.year.astype(int)
df1 = df1[['county','year','pop','area']]
print (df1)
county year pop area
0 1001 2006 1037 275
3 1001 2007 1052 275
6 1001 2008 1102 275
1 1003 2006 2399 394
4 1003 2007 2424 394
7 1003 2008 2438 394
2 1005 2006 1638 312
5 1005 2007 1647 312
8 1005 2008 1660 312
print (df1.dtypes)
county int64
year int32
pop int64
area int64
dtype: object
Another solution with set_index
, stack
and reset_index
:
df2 = df.set_index(['county','area']).stack().reset_index(name='pop')
df2[['tmp','year']] = df2.level_2.str.split('_', expand=True)
df2 = df2.drop(['level_2', 'tmp'],axis=1)
df2['year'] = df2.year.astype(int)
df2 = df2[['county','year','pop','area']]
print (df2)
county year pop area
0 1001 2006 1037 275
1 1001 2007 1052 275
2 1001 2008 1102 275
3 1003 2006 2399 394
4 1003 2007 2424 394
5 1003 2008 2438 394
6 1005 2006 1638 312
7 1005 2007 1647 312
8 1005 2008 1660 312
print (df2.dtypes)
county int64
year int32
pop int64
area int64
dtype: object
Solution 2:[2]
As the question title suggests, we can use pd.wide_to_long
:
res = pd.wide_to_long(df, stubnames="pop", i=["county", "area"], j="year", sep="_")
to get
pop
county area year
1001 275 2006 1037
2007 1052
2008 1102
1003 394 2006 2399
2007 2424
2008 2438
1005 312 2006 1638
2007 1647
2008 1660
To exactly match the output format in the question, a reset_index
and reindex
(over columns) can be chained:
>>> res.reset_index().reindex(["county", "year", "pop", "area"], axis=1)
county year pop area
0 1001 2006 1037 275
1 1001 2007 1052 275
2 1001 2008 1102 275
3 1003 2006 2399 394
4 1003 2007 2424 394
5 1003 2008 2438 394
6 1005 2006 1638 312
7 1005 2007 1647 312
8 1005 2008 1660 312
Solution 3:[3]
One option is with pivot_longer from pyjanitor
# pip install pyjanitor
import pandas as pd
import janitor
(df
.pivot_longer(
index = ['county', 'area'],
names_to = ('.value', 'year'),
names_sep = '_',
sort_by_appearance=True)
)
county area year pop
0 1001 275 2006 1037
1 1001 275 2007 1052
2 1001 275 2008 1102
3 1003 394 2006 2399
4 1003 394 2007 2424
5 1003 394 2008 2438
6 1005 312 2006 1638
7 1005 312 2007 1647
8 1005 312 2008 1660
For this particular reshape, any part of the columns associated with .value
remain as column headers, while the rest are transposed into columns. you can also change the dtype of the transposed columns (this can be efficient, especially for large data sizes):
(df
.pivot_longer(
index = ['county', 'area'],
names_to = ('.value', 'year'),
names_sep = '_',
names_transform = {'year':int},
sort_by_appearance=True)
)
county area year pop
0 1001 275 2006 1037
1 1001 275 2007 1052
2 1001 275 2008 1102
3 1003 394 2006 2399
4 1003 394 2007 2424
5 1003 394 2008 2438
6 1005 312 2006 1638
7 1005 312 2007 1647
8 1005 312 2008 1660
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | Mustafa Ayd?n |
Solution 3 |