'Pandas split dataframe column for every character

i have multiple dataframe columns which look like this:

                         Day1
0    DDDDDDDDDDBBBBBBAAAAAAAAAABBBBBBDDDDDDDDDDDDDDDD
1    DDDDDDDDDDBBBBBBAAAAAAAAAABBBBBBDDDDDDDDDDDDDDDD
2    DDDDDDDDDDBBBBBBAAAAAAAAAABBBBBBDDDDDDDDDDDDDDDD
3    DDDDDDDDDDBBBBBBAAAAAAAAAABBBBBBDDDDDDDDDDDDDDDD
4    DDDDDDDDDDBBBBBBAAAAAAAAAABBBBBBDDDDDDDDDDDDDDDD

What i want is that every character is seperated in a own column:

     012345678910111213....
0    DDDDDDDDDDBBBBBBAAAAAAAAAABBBBBBDDDDDDDDDDDDDDDD
1    DDDDDDDDDDBBBBBBAAAAAAAAAABBBBBBDDDDDDDDDDDDDDDD
2    DDDDDDDDDDBBBBBBAAAAAAAAAABBBBBBDDDDDDDDDDDDDDDD
3    DDDDDDDDDDBBBBBBAAAAAAAAAABBBBBBDDDDDDDDDDDDDDDD
4    DDDDDDDDDDBBBBBBAAAAAAAAAABBBBBBDDDDDDDDDDDDDDDD

So that "Day 1-Column" is splitted in 48 Columns and every Column has one of the Value A/B/C/D

i tried with split, but that didnt work.



Solution 1:[1]

You can call apply and for each row call pd.Series on the the list of the values:

In [16]:

df['Day1'].apply(lambda x: pd.Series(list(x)))
Out[16]:
  0  1  2  3  4  5  6  7  8  9  ... 38 39 40 41 42 43 44 45 46 47
0  D  D  D  D  D  D  D  D  D  D ...  D  D  D  D  D  D  D  D  D  D
1  D  D  D  D  D  D  D  D  D  D ...  D  D  D  D  D  D  D  D  D  D
2  D  D  D  D  D  D  D  D  D  D ...  D  D  D  D  D  D  D  D  D  D
3  D  D  D  D  D  D  D  D  D  D ...  D  D  D  D  D  D  D  D  D  D
4  D  D  D  D  D  D  D  D  D  D ...  D  D  D  D  D  D  D  D  D  D

[5 rows x 48 columns]

It looks like you have trailing spaces, remove these using str.rstrip:

df['Day1'] = df['Day1'].str.rstrip()

then do the above.

Solution 2:[2]

use Series.str.extractall() method:

In [19]: df.Day1.str.extractall('(.)', flags=re.U)[0].unstack().rename_axis(None, 1)
Out[19]:
  0  1  2  3  4  5  6  7  8  9  ... 38 39 40 41 42 43 44 45 46 47
0  D  D  D  D  D  D  D  D  D  D ...  D  D  D  D  D  D  D  D  D  D
1  D  D  D  D  D  D  D  D  D  D ...  D  D  D  D  D  D  D  D  D  D
2  D  D  D  D  D  D  D  D  D  D ...  D  D  D  D  D  D  D  D  D  D
3  D  D  D  D  D  D  D  D  D  D ...  D  D  D  D  D  D  D  D  D  D
4  D  D  D  D  D  D  D  D  D  D ...  D  D  D  D  D  D  D  D  D  D

[5 rows x 48 columns]

Solution 3:[3]

Try this:

df['Day1'].str.split(pat ="\s*", expand = True)

It will have empty 1st and last columns so you have to trim the dataframe using df['Day1'].iloc[:,1:-1]

Solution 4:[4]

Following on from the answer from @ric-s, using list to separate the string is slightly faster when applying it outside of pandas:

In [1]: %timeit df['Day1'].apply(lambda x: pd.Series(list(x)))
1.08 ms ± 26.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [2]: %timeit pd.DataFrame([list(x) for x in df['Day1']])
718 µs ± 2.49 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Also, the following construction will create meaningful column names for the extracted features:

df[[f'Day1_{i}' for i in range(len(df['Day1'][0]))]] = pd.DataFrame([list(x) for x in df['Day1']])

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Ric S
Solution 2
Solution 3 arjepak
Solution 4 njp