'Pandas - Duplicate Rows and Slice String
I'm trying to create duplicate rows during a dataframe on conditions.
For example, I have this Dataframe.
students = [
("a", "Ursula"),
("b", "Hayfa, Martin"),
("c", "Kato"),
("d", "Tanek, Ava, Pyto"),
("e", "Aiko"),
("f", "Hunter"),
("g", "Josiah, Derek, Uma, Nell"),
]
df = pd.DataFrame(students, columns=["team", "student"])
print(df)
team student
a Ursula
b Hayfa, Martin
c Kato
d Tanek, Ava, Pyto
e Aiko
f Hunter
g Josiah, Derek, Uma, Nell
And I would like to get the following output:
team student name remark
a Ursula Ursula
b Hayfa, Martin Hayfa with Martin
b Hayfa, Martin Martin with Hayfa
c Kato Kato
d Tanek, Ava, Pyto Tanek with Ava, Pyto
d Tanek, Ava, Pyto Ava with Tanek, Pyto
d Tanek, Ava, Pyto Pyto with Tanek, Ava
e Aiko Aiko
f Hunter Hunter
g Josiah, Derek, Uma, Nell Josiah with Derek, Uma, Nell
g Josiah, Derek, Uma, Nell Derek with Josiah, Uma, Nell
g Josiah, Derek, Uma, Nell Uma with Josiah, Derek, Nell
g Josiah, Derek, Uma, Nell Nell with Josiah, Derek, Uma
Solution 1:[1]
For pandas 0.25+ is possible use DataFrame.explode
with splitted values by Series.str.split
and for remark
column list comprehension with filtering:
students = df["student"].str.split(", ")
df = df.assign(name=students, remark=students).explode("name").reset_index(drop=True)
df["remark"] = [
"with " + ", ".join(x for x in r if x != n) if len(r) > 1 else ""
for n, r in zip(df["name"], df["remark"])
]
print (df)
And we get the following result:
team student name remark
0 a Ursula Ursula
1 b Hayfa, Martin Hayfa with Martin
2 b Hayfa, Martin Martin with Hayfa
3 c Kato Kato
4 d Tanek, Ava, Pyto Tanek with Ava, Pyto
5 d Tanek, Ava, Pyto Ava with Tanek, Pyto
6 d Tanek, Ava, Pyto Pyto with Tanek, Ava
7 e Aiko Aiko
8 f Hunter Hunter
9 g Josiah, Derek, Uma, Nell Josiah with Derek, Uma, Nell
10 g Josiah, Derek, Uma, Nell Derek with Josiah, Uma, Nell
11 g Josiah, Derek, Uma, Nell Uma with Josiah, Derek, Nell
12 g Josiah, Derek, Uma, Nell Nell with Josiah, Derek, Uma
Solution 2:[2]
Here is another way:
(df.assign(name = df['student'].str.split(','))
.explode('name')
.assign(remark = lambda x: (x['student'].str.split(', ').map(set) - x['name'].map(lambda y: set(y.split())))
.map(lambda z: 'with {}'.format(' '.join(z)) if len(z)>0 else "")))
Output:
team student name remark
0 a Ursula Ursula
1 b Hayfa, Martin Hayfa with Martin
1 b Hayfa, Martin Martin with Hayfa
2 c Kato Kato
3 d Tanek, Ava, Pyto Tanek with Ava Pyto
3 d Tanek, Ava, Pyto Ava with Tanek Pyto
3 d Tanek, Ava, Pyto Pyto with Tanek Ava
4 e Aiko Aiko
5 f Hunter Hunter
6 g Josiah, Derek, Uma, Nell Josiah with Derek Nell Uma
6 g Josiah, Derek, Uma, Nell Derek with Josiah Nell Uma
6 g Josiah, Derek, Uma, Nell Uma with Derek Josiah Nell
6 g Josiah, Derek, Uma, Nell Nell with Derek Josiah Uma
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Jean-Francois T. |
Solution 2 | rhug123 |