'Efficient way to unnest (explode) multiple list columns in a pandas DataFrame

I am reading multiple JSON objects into one DataFrame. The problem is that some of the columns are lists. Also, the data is very big and because of that I cannot use the available solutions on the internet. They are very slow and memory-inefficient

Here is how my data looks like:

df = pd.DataFrame({'A': ['x1','x2','x3', 'x4'], 'B':[['v1','v2'],['v3','v4'],['v5','v6'],['v7','v8']], 'C':[['c1','c2'],['c3','c4'],['c5','c6'],['c7','c8']],'D':[['d1','d2'],['d3','d4'],['d5','d6'],['d7','d8']], 'E':[['e1','e2'],['e3','e4'],['e5','e6'],['e7','e8']]})
    A       B          C           D           E
0   x1  [v1, v2]    [c1, c2]    [d1, d2]    [e1, e2]
1   x2  [v3, v4]    [c3, c4]    [d3, d4]    [e3, e4]
2   x3  [v5, v6]    [c5, c6]    [d5, d6]    [e5, e6]
3   x4  [v7, v8]    [c7, c8]    [d7, d8]    [e7, e8]

And this is the shape of my data: (441079, 12)

My desired output is:

    A       B          C           D           E
0   x1      v1         c1         d1          e1
0   x1      v2         c2         d2          e2
1   x2      v3         c3         d3          e3
1   x2      v4         c4         d4          e4
.....

EDIT: After being marked as duplicate, I would like to stress on the fact that in this question I was looking for an efficient method of exploding multiple columns. Therefore the approved answer is able to explode an arbitrary number of columns on very large datasets efficiently. Something that the answers to the other question failed to do (and that was the reason I asked this question after testing those solutions).

Solution 1:^[1]

pandas >= 0.25

Assuming all columns have the same number of lists, you can call Series.explode on each column.

df.set_index(['A']).apply(pd.Series.explode).reset_index()

    A   B   C   D   E
0  x1  v1  c1  d1  e1
1  x1  v2  c2  d2  e2
2  x2  v3  c3  d3  e3
3  x2  v4  c4  d4  e4
4  x3  v5  c5  d5  e5
5  x3  v6  c6  d6  e6
6  x4  v7  c7  d7  e7
7  x4  v8  c8  d8  e8

The idea is to set as the index all columns that must NOT be exploded first, then reset the index after.

It's also faster.

%timeit df.set_index(['A']).apply(pd.Series.explode).reset_index()
%%timeit
(df.set_index('A')
   .apply(lambda x: x.apply(pd.Series).stack())
   .reset_index()
   .drop('level_1', 1))


2.22 ms ± 98.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
9.14 ms ± 329 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Solution 2:^[2]

def explode(df, lst_cols, fill_value=''):
    # make sure `lst_cols` is a list
    if lst_cols and not isinstance(lst_cols, list):
        lst_cols = [lst_cols]
    # all columns except `lst_cols`
    idx_cols = df.columns.difference(lst_cols)

    # calculate lengths of lists
    lens = df[lst_cols[0]].str.len()

    if (lens > 0).all():
        # ALL lists in cells aren't empty
        return pd.DataFrame({
            col:np.repeat(df[col].values, df[lst_cols[0]].str.len())
            for col in idx_cols
        }).assign(**{col:np.concatenate(df[col].values) for col in lst_cols}) \
          .loc[:, df.columns]
    else:
        # at least one list in cells is empty
        return pd.DataFrame({
            col:np.repeat(df[col].values, df[lst_cols[0]].str.len())
            for col in idx_cols
        }).assign(**{col:np.concatenate(df[col].values) for col in lst_cols}) \
          .append(df.loc[lens==0, idx_cols]).fillna(fill_value) \
          .loc[:, df.columns]

Usage:

In [82]: explode(df, lst_cols=list('BCDE'))
Out[82]:
    A   B   C   D   E
0  x1  v1  c1  d1  e1
1  x1  v2  c2  d2  e2
2  x2  v3  c3  d3  e3
3  x2  v4  c4  d4  e4
4  x3  v5  c5  d5  e5
5  x3  v6  c6  d6  e6
6  x4  v7  c7  d7  e7
7  x4  v8  c8  d8  e8

Solution 3:^[3]

Use set_index on A and on remaining columns apply and stack the values. All of this condensed into a single liner.

In [1253]: (df.set_index('A')
              .apply(lambda x: x.apply(pd.Series).stack())
              .reset_index()
              .drop('level_1', 1))
Out[1253]:
    A   B   C   D   E
0  x1  v1  c1  d1  e1
1  x1  v2  c2  d2  e2
2  x2  v3  c3  d3  e3
3  x2  v4  c4  d4  e4
4  x3  v5  c5  d5  e5
5  x3  v6  c6  d6  e6
6  x4  v7  c7  d7  e7
7  x4  v8  c8  d8  e8

Solution 4:^[4]

Building on @cs95's answer, we can use an if clause in the lambda function, instead of setting all the other columns as the index. This has the following advantages:

Preserves column order
Lets you easily specify columns using the set you want to modify, x.name in [...], or not modify x.name not in [...].

df.apply(lambda x: x.explode() if x.name in ['B', 'C', 'D', 'E'] else x)

     A   B   C   D   E
0   x1  v1  c1  d1  e1
0   x1  v2  c2  d2  e2
1   x2  v3  c3  d3  e3
1   x2  v4  c4  d4  e4
2   x3  v5  c5  d5  e5
2   x3  v6  c6  d6  e6
3   x4  v7  c7  d7  e7
3   x4  v8  c8  d8  e8

Solution 5:^[5]

As of pandas 1.3.0:

DataFrame.explode() now supports exploding multiple columns. Its column argument now also accepts a list of str or tuples for exploding on multiple columns at the same time (GH39240)

What’s new in 1.3.0 (July 2, 2021)

So now this operation is as simple as:

df.explode(['B', 'C', 'D', 'E'])

    A   B   C   D   E
0  x1  v1  c1  d1  e1
0  x1  v2  c2  d2  e2
1  x2  v3  c3  d3  e3
1  x2  v4  c4  d4  e4
2  x3  v5  c5  d5  e5
2  x3  v6  c6  d6  e6
3  x4  v7  c7  d7  e7
3  x4  v8  c8  d8  e8

Or if wanting unique indexing:

df.explode(['B', 'C', 'D', 'E'], ignore_index=True)

    A   B   C   D   E
0  x1  v1  c1  d1  e1
1  x1  v2  c2  d2  e2
2  x2  v3  c3  d3  e3
3  x2  v4  c4  d4  e4
4  x3  v5  c5  d5  e5
5  x3  v6  c6  d6  e6
6  x4  v7  c7  d7  e7
7  x4  v8  c8  d8  e8

Solution 6:^[6]

Here is my solution using 'apply' function. Main features/differences:

offer options to specify selected multiple columns or all columns
offer options to specify values to fill in the 'missing' position (through parameter fill_mode = 'external'; 'internal'; or 'trim', explanation would be long, see examples below and try yourself to change the option and check the result)

Notes: option 'trim' was developed for my need, out of scope for this question

def cell_size_equalize2(row, cols='', fill_mode='internal', fill_value=''):
    jcols = [j for j,v in enumerate(row.index) if v in cols]
    if len(jcols)<1:
        jcols = range(len(row.index))
    Ls = [lenx(x) for x in row.values]
    if not Ls[:-1]==Ls[1:]:
        vals = [v if isinstance(v,list) else [v] for v in row.values]
        if fill_mode=='external':
            vals = [[e] + [fill_value]*(max(Ls)-1) if (not j in jcols) and (isinstance(row.values[j],list))
                    else e + [fill_value]*(max(Ls)-lenx(e))
                    for j,e in enumerate(vals)]
        elif fill_mode == 'internal':
            vals = [[e]+[e]*(max(Ls)-1) if (not j in jcols) and (isinstance(row.values[j],list))
                    else e+[e[-1]]*(max(Ls)-lenx(e)) 
                    for j,e in enumerate(vals)]
        else:
            vals = [e[0:min(Ls)] for e in vals]
        row = pd.Series(vals,index=row.index.tolist())
    return row

Examples:

df=pd.DataFrame({
    'a':[[1],2,3],
    'b':[[4,5,7],[5,4],4],
    'c':[[4,5],5,[6]]
})
print(df)
df1 = df.apply(cell_size_equalize2, cols='', fill_mode='external', fill_value = "OK", axis=1).apply(pd.Series.explode)
print('\nfill_mode=\'external\', all columns, fill_value = \'OK\'\n', df1)
df2 = df.apply(cell_size_equalize2, cols=['a', 'b'], fill_mode='external', fill_value = "OK", axis=1).apply(pd.Series.explode)
print('\nfill_mode=\'external\', cols = [\'a\', \'b\'], fill_value = \'OK\'\n', df2)
df3 = df.apply(cell_size_equalize2, cols=['a', 'b'], fill_mode='internal', axis=1).apply(pd.Series.explode)
print('\nfill_mode=\'internal\', cols = [\'a\', \'b\']\n', df3)
df4 = df.apply(cell_size_equalize2, cols='', fill_mode='trim', axis=1).apply(pd.Series.explode)
print('\nfill_mode=\'trim\', all columns\n', df4)

Output:

     a          b       c
0  [1]  [4, 5, 7]  [4, 5]
1    2     [5, 4]       5
2    3          4     [6]

fill_mode='external', all columns, fill_value = 'OK'
     a  b   c
0   1  4   4
0  OK  5   5
0  OK  7  OK
1   2  5   5
1  OK  4  OK
2   3  4   6

fill_mode='external', cols = ['a', 'b'], fill_value = 'OK'
     a  b       c
0   1  4  [4, 5]
0  OK  5      OK
0  OK  7      OK
1   2  5       5
1  OK  4      OK
2   3  4       6

fill_mode='internal', cols = ['a', 'b']
    a  b       c
0  1  4  [4, 5]
0  1  5  [4, 5]
0  1  7  [4, 5]
1  2  5       5
1  2  4       5
2  3  4       6

fill_mode='trim', all columns
    a  b  c
0  1  4  4
1  2  5  5
2  3  4  6

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	Community
Solution 2	MaxU - stop genocide of UA
Solution 3
Solution 4	benwshul
Solution 5	Henry Ecker
Solution 6	nphaibk

'Efficient way to unnest (explode) multiple list columns in a pandas DataFrame

Solution 1:[1]

pandas >= 0.25

Solution 2:[2]

Solution 3:[3]

Solution 4:[4]

Solution 5:[5]

Solution 6:[6]