'pandas | list in column to binary column

I have the following dataframe:

+------------+------------------+
| item       | categories       |
+------------+------------------+
| blue_shirt | ['red', 'white'] |
+------------+------------------+
| red_skirt  | ['blue', 'red']  |
+------------+------------------+

and I want to get this instead:

+------------+-----+-------+------+
| item       | red | white | blue |
+------------+-----+-------+------+
| blue_shirt | 1   | 1     | 0    |
+------------+-----+-------+------+
| red_skirt  | 1   | 0     | 1    |
+------------+-----+-------+------+

here is what I tried:

orders = orders.join(pd.get_dummies(orders['Categories'].explode()))

it creates the right columns however it creates (a lot) of additional rows too. I want one row in the end for each item like in the example above.



Solution 1:[1]

You can explode the categories and the pivot the dataframe:

print(
    df.explode("categories")
    .pivot_table(
        index="item", columns="categories", aggfunc="size", fill_value=0
    )
    .reset_index()
)

Prints:

categories        item  blue  red  white
0           blue_shirt     0    1      1
1            red_skirt     1    1      0

Solution 2:[2]

You can also solve this problem with a one-liner using .str method of pandas

df['categories'].str.join('|').str.get_dummies()

The format within each cell of 'categories' column needs to be a list. If it's a string of something else, you can make it a list with an .apply function. For example, if content of 'categories' column is a list saved as a string:

df['categories'].apply(lambda x: eval(x)).str.join('|').str.get_dummies()

Solution 3:[3]

I was able to get to your desired outcome with the below approach.

Step 1: creating a list of columns that will be used in your resulting dataset.

>>> cols = list(set(df['categories'].explode())) #set makes sure we keep unique columns
>>> cols.insert(0,df.columns[0]) 

cols
Out[359]: ['item', 'red', 'white', 'blue']

OPTIONAL: You can make sure explode() works by converting your column to list type: taken from here

from ast import literal_eval
df['categories'] = df['categories'].apply(literal_eval) # convert to list type

Step 2: use your code with pd.get_dummies(), and create the 1/0 dataset. To avoid getting more rows, I added an additional step using groupby.index:

>>> temp = pd.get_dummies(df['categories'].explode())
>>> temp_res = temp.groupby(temp.index).sum()

Out[365]: 
   blue  red  white
0     0    1      1
1     1    1      0

Step 3:, I concat your 'item' column, with the above result, and used the column list created in the first step to get your desired result:

>>> out = pd.concat([df['item'],temp_res],axis=1,ignore_index=False)
>>> out = out[cols]

Out[368]: 
         item  red  white  blue
0  blue_shirt    1      1     0
1   red_skirt    1      0     1

All the code in 1 block:

from ast import literal_eval
df['categories'] = df['categories'].apply(literal_eval) #convert to list type

cols = list(set(df['categories'].explode()))
cols.insert(0,df.columns[0]) 

temp = pd.get_dummies(df['categories'].explode())
temp_res = temp.groupby(temp.index).sum()

out = pd.concat([df['item'],temp_res],axis=1)
out = out[cols]

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Andrej Kesely
Solution 2 user3486942
Solution 3