'Split corresponding column values in pyspark
Below table would be the input dataframe
col1 | col2 | col3 |
---|---|---|
1 | 12;34;56 | Aus;SL;NZ |
2 | 31;54;81 | Ind;US;UK |
3 | null | Ban |
4 | Ned | null |
Expected output dataframe [values of col2 and col3 should be split by ; correspondingly]
col1 | col2 | col3 |
---|---|---|
1 | 12 | Aus |
1 | 34 | SL |
1 | 56 | NZ |
2 | 31 | Ind |
2 | 54 | US |
2 | 81 | UK |
3 | null | Ban |
4 | Ned | null |
Solution 1:[1]
Below code works perfectly fine
data = [(1,'12;34;56', 'Aus;SL;NZ'),
(2,'31;54;81', 'Ind;US;UK'),
(3,None, 'Ban'),
(4,'Ned', None) ]
columns = ['Id', 'Score','Countries']
df = spark.createDataFrame(data, columns)
#df.show()
df2=df.select("*",posexplode_outer(split("Countries",";")).alias("pos1","value1"))
#df2.show()
df3=df2.select("*",posexplode_outer(split("Score",";")).alias("pos2","value2"))
#df3.show()
df4=df3.filter((df3.pos1==df3.pos2) | (df3.pos1.isNull() | df3.pos2.isNull()))
df4=df4.select("Id","value2","value1")
df4.show() #Final Output
Solution 2:[2]
You can use the pyspark function split()
to convert the column with multiple values into an array and then the function explode()
to make multiple rows out of the different values.
It may look like this:
df = df.withColumn("<columnName>", explode(split(df.<columnName>, ";")))
If you want to keep NULL values you can use explode_outer()
.
If you want the values of multiple exploded arrays to match in the rows, you could work with posexplode()
and then filter()
to the rows where the positions are corresponding.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Vineeth Reddy |
Solution 2 |