'pandas | list in column to binary column
I have the following dataframe:
+------------+------------------+
| item | categories |
+------------+------------------+
| blue_shirt | ['red', 'white'] |
+------------+------------------+
| red_skirt | ['blue', 'red'] |
+------------+------------------+
and I want to get this instead:
+------------+-----+-------+------+
| item | red | white | blue |
+------------+-----+-------+------+
| blue_shirt | 1 | 1 | 0 |
+------------+-----+-------+------+
| red_skirt | 1 | 0 | 1 |
+------------+-----+-------+------+
here is what I tried:
orders = orders.join(pd.get_dummies(orders['Categories'].explode()))
it creates the right columns however it creates (a lot) of additional rows too. I want one row in the end for each item like in the example above.
Solution 1:[1]
You can explode the categories and the pivot the dataframe:
print(
df.explode("categories")
.pivot_table(
index="item", columns="categories", aggfunc="size", fill_value=0
)
.reset_index()
)
Prints:
categories item blue red white
0 blue_shirt 0 1 1
1 red_skirt 1 1 0
Solution 2:[2]
You can also solve this problem with a one-liner using .str method of pandas
df['categories'].str.join('|').str.get_dummies()
The format within each cell of 'categories' column needs to be a list. If it's a string of something else, you can make it a list with an .apply function. For example, if content of 'categories' column is a list saved as a string:
df['categories'].apply(lambda x: eval(x)).str.join('|').str.get_dummies()
Solution 3:[3]
I was able to get to your desired outcome with the below approach.
Step 1: creating a list of columns that will be used in your resulting dataset.
>>> cols = list(set(df['categories'].explode())) #set makes sure we keep unique columns
>>> cols.insert(0,df.columns[0])
cols
Out[359]: ['item', 'red', 'white', 'blue']
OPTIONAL: You can make sure explode()
works by converting your column to list type: taken from here
from ast import literal_eval
df['categories'] = df['categories'].apply(literal_eval) # convert to list type
Step 2: use your code with pd.get_dummies()
, and create the 1/0 dataset. To avoid getting more rows, I added an additional step using groupby.index
:
>>> temp = pd.get_dummies(df['categories'].explode())
>>> temp_res = temp.groupby(temp.index).sum()
Out[365]:
blue red white
0 0 1 1
1 1 1 0
Step 3:, I concat
your 'item' column, with the above result, and used the column list created in the first step to get your desired result:
>>> out = pd.concat([df['item'],temp_res],axis=1,ignore_index=False)
>>> out = out[cols]
Out[368]:
item red white blue
0 blue_shirt 1 1 0
1 red_skirt 1 0 1
All the code in 1 block:
from ast import literal_eval
df['categories'] = df['categories'].apply(literal_eval) #convert to list type
cols = list(set(df['categories'].explode()))
cols.insert(0,df.columns[0])
temp = pd.get_dummies(df['categories'].explode())
temp_res = temp.groupby(temp.index).sum()
out = pd.concat([df['item'],temp_res],axis=1)
out = out[cols]
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Andrej Kesely |
Solution 2 | user3486942 |
Solution 3 |