'pyspark create dictionary from data in two columns
I have a pyspark dataframe with two columns:
[Row(zip_code='58542', dma='MIN'),
Row(zip_code='58701', dma='MIN'),
Row(zip_code='57632', dma='MIN'),
Row(zip_code='58734', dma='MIN')]
How can I make a key:value pair out of the data inside the columns?
e.g.:
{
"58542":"MIN",
"58701:"MIN",
etc..
}
I would like to avoid using collect for performance reasons. I've tried a few things but can't seem to get just the values.
Solution 1:[1]
As Ankin says, you can use a MapType for this:
import pyspark
from pyspark.sql import Row
sc = pyspark.SparkContext()
spark = pyspark.sql.SparkSession(sc)
data = spark.createDataFrame([Row(zip_code='58542', dma='MIN'),
Row(zip_code='58701', dma='MIN'),
Row(zip_code='57632', dma='MIN'),
Row(zip_code='58734', dma='MIN')])
data.show()
Output:
+---+--------+
|dma|zip_code|
+---+--------+
|MIN| 58542|
|MIN| 58701|
|MIN| 57632|
|MIN| 58734|
+---+--------+
from pyspark.sql.functions import udf
from pyspark.sql import types as T
@udf(T.MapType(T.StringType(), T.StringType()))
def create_struct(zip_code, dma):
return {zip_code: dma}
data.withColumn('struct', create_struct(data.zip_code, data.dma)).toJSON().collect()
Output:
['{"dma":"MIN","zip_code":"58542","struct":{"58542":"MIN"}}',
'{"dma":"MIN","zip_code":"58701","struct":{"58701":"MIN"}}',
'{"dma":"MIN","zip_code":"57632","struct":{"57632":"MIN"}}',
'{"dma":"MIN","zip_code":"58734","struct":{"58734":"MIN"}}']
Solution 2:[2]
You can avoid using a udf
here using pyspark.sql.functions.struct
and pyspark.sql.functions.to_json
(Spark version 2.1 and above):
import pyspark.sql.functions as f
from pyspark.sql import Row
data = [
Row(zip_code='58542', dma='MIN'),
Row(zip_code='58701', dma='MIN'),
Row(zip_code='57632', dma='MIN'),
Row(zip_code='58734', dma='MIN')
]
df = spark.createDataFrame(data)
df.withColumn("json", f.to_json(f.struct("dma", "zip_code"))).show(truncate=False)
#+---+--------+--------------------------------+
#|dma|zip_code|json |
#+---+--------+--------------------------------+
#|MIN|58542 |{"dma":"MIN","zip_code":"58542"}|
#|MIN|58701 |{"dma":"MIN","zip_code":"58701"}|
#|MIN|57632 |{"dma":"MIN","zip_code":"57632"}|
#|MIN|58734 |{"dma":"MIN","zip_code":"58734"}|
#+---+--------+--------------------------------+
If you instead wanted the zip_code
to be the key, you can create a MapType
directly using pyspark.sql.functions.create_map
:
df.withColumn("json", f.create_map(["zip_code", "dma"])).show(truncate=False)
#+---+--------+-----------------+
#|dma|zip_code|json |
#+---+--------+-----------------+
#|MIN|58542 |Map(58542 -> MIN)|
#|MIN|58701 |Map(58701 -> MIN)|
#|MIN|57632 |Map(57632 -> MIN)|
#|MIN|58734 |Map(58734 -> MIN)|
#+---+--------+-----------------+
Solution 3:[3]
You can simply do this:
dict = {row['zipcode']:row['dma'] for row in df.collect()}
print(dict)
#{'58542': 'MIN', '58701': 'MIN', '57632': 'MIN', '58734': 'MIN'}
Solution 4:[4]
There is one more way to convert your dataframe into dict. for that you need to convert your dataframe into key-value pair rdd as it will be applicable only to key-value pair rdd. since dictionary itself a combination of key value pairs.
data = [
Row(zip_code='58542', dma='MIN'),
Row(zip_code='58701', dma='MIN'),
Row(zip_code='57632', dma='MIN'),
Row(zip_code='58734', dma='MIN')
]
>>> data.show();
+---+--------+
|dma|zip_code|
+---+--------+
|MIN| 58542|
|MIN| 58701|
|MIN| 57632|
|MIN| 58734|
+---+--------+
converting your dataframe into rdd.
newrdd = data.rdd
since you want zip_code as your key and dma as value, so have selected rdd element '1' as key and element '0' as value.
keypair_rdd = newrdd.map(lambda x : (x[1],x[0]))
once you have key-pair rdd then simply use collectAsMap to convert it into a dictonary
>>> dict = keypair_rdd.collectAsMap()
>>> print dict
{u'58542': u'MIN', u'57632': u'MIN', u'58734': u'MIN', u'58701': u'MIN'}
>>> dict.keys()
[u'58542', u'57632', u'58734', u'58701']
looks value for specific key:
>>> dict.get('58542')
u'MIN'
Solution 5:[5]
yes you can using
pyspark.sql.types.MapType(keyType, valueType, valueContainsNull=True)
please share the more info like dataframe sample output and the way you want as an output that will help in writing a code snippet for the same.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | martinarroyo |
Solution 2 | |
Solution 3 | |
Solution 4 | vikrant rana |
Solution 5 | Ankit Kumar Namdeo |