'Reshape wide to long for many columns with a common prefix
My frame has many pairs of identically named columns, with the only difference being the prefix. For example, player1.player.id
and player2.player.id
.
Here's an example (with fewer and shorter columns):
pd.DataFrame({'p1.a': {0: 4, 1: 0}, 'p1.b': {0: 1, 1: 4},
'p1.c': {0: 2, 1: 8}, 'p1.d': {0: 3, 1: 12},
'p1.e': {0: 4, 1: 16}, 'p1.f': {0: 5, 1: 20},
'p1.g': {0: 6, 1: 24},
'p2.a': {0: 0, 1: 0}, 'p2.b': {0: 3, 1: 12},
'p2.c': {0: 6, 1: 24}, 'p2.d': {0: 9, 1: 36},
'p2.e': {0: 12, 1: 48}, 'p2.f': {0: 15, 1: 60},
'p2.g': {0: 18, 1: 72}})
p1.a p1.b p1.c p1.d p1.e p1.f p1.g p2.a p2.b p2.c p2.d p2.e p2.f p2.g
0 4 1 2 3 4 5 6 0 3 6 9 12 15 18
1 0 4 8 12 16 20 24 0 12 24 36 48 60 72
I'd like to turn it into a long format, with a new side
column denoting either p1
or p2
. I have several crappy ways of doing it, for example:
df1 = df.filter(regex='^p1.*').assign(side='p1')
df2 = df.filter(regex='^p2.*').assign(side='p2')
df1.columns = [c.replace('p1.', '') for c in df1.columns]
df2.columns = [c.replace('p2.', '') for c in df2.columns]
pd.concat([df1, df2]).head()
a b c d e f g side
0 4 1 2 3 4 5 6 p1
1 0 4 8 12 16 20 24 p1
0 0 3 6 9 12 15 18 p2
1 0 12 24 36 48 60 72 p2
This feels non-idiomatic, and I couldn't get pd.wide_to_long()
to work here.
I'd appreciate an answer which also handles arbitrary substrings, not just prefix, i.e., I'm also interested in something like this:
foo.p1.a foo.p1.b foo.p1.c foo.p1.d foo.p1.e foo.p1.f foo.p1.g foo.p2.a foo.p2.b foo.p2.c foo.p2.d foo.p2.e foo.p2.f foo.p2.g
0 4 1 2 3 4 5 6 0 3 6 9 12 15 18
1 0 4 8 12 16 20 24 0 12 24 36 48 60 72
Turning into:
foo.a foo.b foo.c foo.d foo.e foo.f foo.g side
0 4 1 2 3 4 5 6 p1
1 0 4 8 12 16 20 24 p1
0 0 3 6 9 12 15 18 p2
1 0 12 24 36 48 60 72 p2
But if there's an idiomatic way to handle prefixes whereas substrings require complexity, I'd appreciate learning about both.
What's the idiomatic (pythonic? pandonic?) way of doing this?
Solution 1:[1]
A couple of options to do this:
with pd.wide_to_long, you need to reorder the positions based on the delimiter; in this case we move the a
, b
, ... to the fore and the p1
, p2
to the back, before reshaping:
temp = df.copy()
temp = temp.rename(columns = lambda df: ".".join(df.split(".")[::-1]))
(pd.wide_to_long(temp.reset_index(),
stubnames = ["a", "b", "c", "d", "e", "f", "g"],
sep=".",
suffix=".+",
i = "index",
j = "side")
.droplevel('index')
.reset_index()
side a b c d e f g
0 p1 4 1 2 3 4 5 6
1 p1 0 4 8 12 16 20 24
2 p2 0 3 6 9 12 15 18
3 p2 0 12 24 36 48 60 72
One limitation with pd.wide_to_long is the reshaping of positions. The other limitation is that the stubnames have to be explicitly specified.
Another option is via stack, where the columns are split, based on the delimiter and reshaped:
temp = df.copy()
temp.columns = temp.columns.str.split(".", expand = True)
temp.stack(0).droplevel(0).rename_axis('side').reset_index()
side a b c d e f g
0 p1 4 1 2 3 4 5 6
1 p2 0 3 6 9 12 15 18
2 p1 0 4 8 12 16 20 24
3 p2 0 12 24 36 48 60 72
stack is quite flexible, and did not require us to list the column names. The limitation of stack is that it fails if the index is not unique.
Another option is pivot_longer from pyjanitor, which abstracts the process:
# pip install janitor
import janitor
df.pivot_longer(index = None,
names_to = ("side", ".value"),
names_sep=".")
side a b c d e f g
0 p1 4 1 2 3 4 5 6
1 p1 0 4 8 12 16 20 24
2 p2 0 3 6 9 12 15 18
3 p2 0 12 24 36 48 60 72
The worker here is .value
. This tells the code that anything after .
should remain as column names, while anything before .
should be collated into a new column (side
). Note that, unlike wide_to_long
, the stubnames do not need to be stated - it abstracts that for us. Also, it can handle duplicate indices, since it uses pd.melt
under the hood.
One limitation of pivot_longer
is that you have to install the pyjanitor library.
For the other example, I'll use stack and pivot_longer; you can still use pd.wide_to_long to solve it.
With stack:
first split the columns and convert into a MultiIndex:
temp = df.copy()
temp.columns = temp.columns.str.split(".", expand = True)
Reshape the data:
temp = temp.stack(1).droplevel(0).rename_axis('side')
Merge the column names:
temp.columns = temp.columns.map(".".join)
Reset the index:
temp.reset_index()
side foo.a foo.b foo.c foo.d foo.e foo.f foo.g
0 p1 4 1 2 3 4 5 6
1 p2 0 3 6 9 12 15 18
2 p1 0 4 8 12 16 20 24
3 p2 0 12 24 36 48 60 72
With pivot_longer, one option is to reorder the columns, before reshaping:
temp = df.copy()
temp.columns = ["".join([first, last, middle])
for first, middle, last in
temp.columns.str.split(r'(\.p\d)')]
(
temp
.pivot_longer(
index = None,
names_to = ('.value', 'side'),
names_pattern = r"(.+)\.(p\d)")
)
side foo.a foo.b foo.c foo.d foo.e foo.f foo.g
0 p1 4 1 2 3 4 5 6
1 p1 0 4 8 12 16 20 24
2 p2 0 3 6 9 12 15 18
3 p2 0 12 24 36 48 60 72
In the dev version however, the column reorder is not necessary; we can simply use multiple .value
to reshape the dataframe - note that you'll have to install from the repo to get the latest dev version:
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
(df
.pivot_longer(
index = None,
names_to = ('.value', 'side', '.value'),
names_pattern = r"(.+)\.(.\d)(.+)")
)
side foo.a foo.b foo.c foo.d foo.e foo.f foo.g
0 p1 4 1 2 3 4 5 6
1 p1 0 4 8 12 16 20 24
2 p2 0 3 6 9 12 15 18
3 p2 0 12 24 36 48 60 72
Another option with names_sep
:
(df
.pivot_longer(
index = None,
names_to = ('.value', 'side', '.value'),
names_sep = r'\.(p\d)')
)
side foo.a foo.b foo.c foo.d foo.e foo.f foo.g
0 p1 4 1 2 3 4 5 6
1 p1 0 4 8 12 16 20 24
2 p2 0 3 6 9 12 15 18
3 p2 0 12 24 36 48 60 72
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 |