'Categorical column after melt in pandas
Is it possible to end up with a categorical variable column after a melt
operation in pandas?
If I set up the data like this:
import pandas as pd
import numpy as np
df = pd.DataFrame(
np.random.randn(3, 5),
columns=["A", "B", "C", "D", "E"]
)
df["id"] = range(1, 4)
df
| | A | B | C | D | E | id |
|----|-----------|-----------|-----------|-----------|------------|------|
| 0 | -0.406174 | -0.686917 | -0.172913 | -0.273074 | -0.0246714 | 1 |
| 1 | 0.323783 | -1.7731 | 1.57581 | -1.15671 | -1.23926 | 2 |
| 2 | -1.1426 | -0.591279 | 1.15265 | 0.326712 | -0.86374 | 3 |
and then apply
melted_df = df.melt(id_vars="id", value_vars=["A", "B", "C", "D", "E"])
melted_df
| | id | variable | value |
|----|------|------------|------------|
| 0 | 1 | A | -0.406174 |
| 1 | 2 | A | 0.323783 |
| 2 | 3 | A | -1.1426 |
| 3 | 1 | B | -0.686917 |
| 4 | 2 | B | -1.7731 |
| 5 | 3 | B | -0.591279 |
| 6 | 1 | C | -0.172913 |
| 7 | 2 | C | 1.57581 |
| 8 | 3 | C | 1.15265 |
| 9 | 1 | D | -0.273074 |
| 10 | 2 | D | -1.15671 |
| 11 | 3 | D | 0.326712 |
| 12 | 1 | E | -0.0246714 |
| 13 | 2 | E | -1.23926 |
| 14 | 3 | E | -0.86374 |
The dtype of the variable
column is object
melted_df.dtypes
id int64
variable object
value float64
dtype: object
I'd like this to be category
. I know, I can convert it easily by:
melted_df["variable"].astype("category")
But for large datasets, I'd like to avoid this overhead. In the documentation I didn't find such an option, but since the resulting column contains categorical data by definition, I presume there must be a possiblity.
Solution 1:[1]
I don't think it's possible with melt
, because when it creates that column it infers the dtype and 'category'
is not a dtype
that pandas currently infers. (Here's a related issue where it doesn't correctly infer Int32 dtypes Why is pandas.melt messing with my dtypes?).
stack
will keep the categorical dtype if you first convert the columns. stack
will result in a slightly different ordering than melt, but the data will be the same. stack
is also a bit clunkier with naming the resulting columns.
df = df.set_index('id')
df.columns = df.columns.astype('category')
res = (df.stack()
.rename_axis(['id', 'variable'])
.rename('value')
.reset_index())
# id variable value
#0 1 A 0.424781
#1 1 B -0.317107
#2 1 C 0.731121
#3 1 D 0.042642
#4 1 E 0.648352
#...
#13 3 D -0.889600
#14 3 E -1.822898
res.dtypes
#id int64
#variable category
#value float64
#dtype: object
Solution 2:[2]
One efficient option is with
pivot_longer from pyjanitor, using the names_transform
parameter:
# pip install pyjanitor
import pandas as pd
import janitor
np.random.seed(456)
df = pd.DataFrame(
np.random.randn(3, 5),
columns=["A", "B", "C", "D", "E"]
)
df["id"] = range(1, 4)
df
A B C D E id
0 -0.668129 -0.498210 0.618576 0.568692 1.350509 1
1 1.629589 0.301966 0.449483 -0.345811 -0.315231 2
2 -2.015971 -1.130231 -1.111846 0.237851 -0.325130 3
result = df.pivot_longer(index = 'id', names_transform = 'category')
result
id variable value
0 1 A -0.668129
1 2 A 1.629589
2 3 A -2.015971
3 1 B -0.498210
4 2 B 0.301966
5 3 B -1.130231
6 1 C 0.618576
7 2 C 0.449483
8 3 C -1.111846
9 1 D 0.568692
10 2 D -0.345811
11 3 D 0.237851
12 1 E 1.350509
13 2 E -0.315231
14 3 E -0.325130
result.dtypes
id int64
variable category
value float64
dtype: object
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 |