'Concatenating string by rows in pyspark
I am having a pyspark dataframe as
DOCTOR | PATIENT
JOHN | SAM
JOHN | PETER
JOHN | ROBIN
BEN | ROSE
BEN | GRAY
and need to concatenate patient names by rows so that I get the output like:
DOCTOR | PATIENT
JOHN | SAM, PETER, ROBIN
BEN | ROSE, GRAY
Can anybody help me regarding creating this dataframe in pyspark ?
Thanks in advance.
Solution 1:[1]
The simplest way I can think of is to use collect_list
import pyspark.sql.functions as f
df.groupby("col1").agg(f.concat_ws(", ", f.collect_list(df.col2)))
Solution 2:[2]
import pyspark.sql.functions as f
from pyspark.context import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType
data = [
("U_104", "food"),
("U_103", "cosmetics"),
("U_103", "children"),
("U_104", "groceries"),
("U_103", "food")
]
schema = StructType([
StructField("user_id", StringType(), True),
StructField("category", StringType(), True),
])
sc = SparkContext.getOrCreate()
spark = SparkSession.builder.appName("groupby").getOrCreate()
df = spark.createDataFrame(data, schema)
group_df = df.groupBy(f.col("user_id")).agg(
f.concat_ws(",", f.collect_list(f.col("category"))).alias("categories")
)
group_df.show()
+-------+--------------------+
|user_id| categories|
+-------+--------------------+
| U_104| food,groceries|
| U_103|cosmetics,childre...|
+-------+--------------------+
There are some useful aggregation examples
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | pault |
Solution 2 | DevShepherd |