'Pandas split dataframe column for every character
i have multiple dataframe columns which look like this:
Day1
0 DDDDDDDDDDBBBBBBAAAAAAAAAABBBBBBDDDDDDDDDDDDDDDD
1 DDDDDDDDDDBBBBBBAAAAAAAAAABBBBBBDDDDDDDDDDDDDDDD
2 DDDDDDDDDDBBBBBBAAAAAAAAAABBBBBBDDDDDDDDDDDDDDDD
3 DDDDDDDDDDBBBBBBAAAAAAAAAABBBBBBDDDDDDDDDDDDDDDD
4 DDDDDDDDDDBBBBBBAAAAAAAAAABBBBBBDDDDDDDDDDDDDDDD
What i want is that every character is seperated in a own column:
012345678910111213....
0 DDDDDDDDDDBBBBBBAAAAAAAAAABBBBBBDDDDDDDDDDDDDDDD
1 DDDDDDDDDDBBBBBBAAAAAAAAAABBBBBBDDDDDDDDDDDDDDDD
2 DDDDDDDDDDBBBBBBAAAAAAAAAABBBBBBDDDDDDDDDDDDDDDD
3 DDDDDDDDDDBBBBBBAAAAAAAAAABBBBBBDDDDDDDDDDDDDDDD
4 DDDDDDDDDDBBBBBBAAAAAAAAAABBBBBBDDDDDDDDDDDDDDDD
So that "Day 1-Column" is splitted in 48 Columns and every Column has one of the Value A/B/C/D
i tried with split, but that didnt work.
Solution 1:[1]
You can call apply
and for each row call pd.Series
on the the list
of the values:
In [16]:
df['Day1'].apply(lambda x: pd.Series(list(x)))
Out[16]:
0 1 2 3 4 5 6 7 8 9 ... 38 39 40 41 42 43 44 45 46 47
0 D D D D D D D D D D ... D D D D D D D D D D
1 D D D D D D D D D D ... D D D D D D D D D D
2 D D D D D D D D D D ... D D D D D D D D D D
3 D D D D D D D D D D ... D D D D D D D D D D
4 D D D D D D D D D D ... D D D D D D D D D D
[5 rows x 48 columns]
It looks like you have trailing spaces, remove these using str.rstrip
:
df['Day1'] = df['Day1'].str.rstrip()
then do the above.
Solution 2:[2]
use Series.str.extractall() method:
In [19]: df.Day1.str.extractall('(.)', flags=re.U)[0].unstack().rename_axis(None, 1)
Out[19]:
0 1 2 3 4 5 6 7 8 9 ... 38 39 40 41 42 43 44 45 46 47
0 D D D D D D D D D D ... D D D D D D D D D D
1 D D D D D D D D D D ... D D D D D D D D D D
2 D D D D D D D D D D ... D D D D D D D D D D
3 D D D D D D D D D D ... D D D D D D D D D D
4 D D D D D D D D D D ... D D D D D D D D D D
[5 rows x 48 columns]
Solution 3:[3]
Try this:
df['Day1'].str.split(pat ="\s*", expand = True)
It will have empty 1st and last columns so you have to trim the dataframe using
df['Day1'].iloc[:,1:-1]
Solution 4:[4]
Following on from the answer from @ric-s, using list
to separate the string is slightly faster when applying it outside of pandas:
In [1]: %timeit df['Day1'].apply(lambda x: pd.Series(list(x)))
1.08 ms ± 26.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [2]: %timeit pd.DataFrame([list(x) for x in df['Day1']])
718 µs ± 2.49 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Also, the following construction will create meaningful column names for the extracted features:
df[[f'Day1_{i}' for i in range(len(df['Day1'][0]))]] = pd.DataFrame([list(x) for x in df['Day1']])
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Ric S |
Solution 2 | |
Solution 3 | arjepak |
Solution 4 | njp |