'pyspark get element from array Column of struct based on condition
I have a spark df with the following schema:
|-- col1 : string
|-- col2 : string
|-- customer: struct
| |-- smt: string
| |-- attributes: array (nullable = true)
| | |-- element: struct
| | | |-- key: string
| | | |-- value: string
df:
#+-------+-------+---------------------------------------------------------------------------+
#|col1 |col2 |customer |
#+-------+-------+---------------------------------------------------------------------------+
#|col1_XX|col2_XX|"attributes":[[{"key": "A", "value": "123"},{"key": "B", "value": "456"}] |
#+-------+-------+---------------------------------------------------------------------------+
and the json input for the array look like this:
...
"attributes": [
{
"key": "A",
"value": "123"
},
{
"key": "B",
"value": "456"
}
],
I would like to loop attributes array and get the element with key="B"
and then select the corresponding value
. I don't want to use explode because I would like to avoid join dataframes.
Is it possible to perform this kind of operation directly using spark 'Column' ?
Expected output will be:
#+-------+-------+-----+
#|col1 |col2 |B | |
#+-------+-------+-----+
#|col1_XX|col2_XX|456 |
#+-------+-------+-----+
any help would be appreciated
Solution 1:[1]
You can use filter
function to filter the array of structs then get value
:
from pyspark.sql import functions as F
df2 = df.withColumn(
"B",
F.expr("filter(customer.attributes, x -> x.key = 'B')")[0]["value"]
)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | blackbishop |