'Processing data from a kafka stream using Pyspark
What the console of the kafka consumer looks like:
["2017-12-31 16:06:01", 12472391, 1]
["2017-12-31 16:06:01", 12472097, 1]
["2017-12-31 16:05:59", 12471979, 1]
["2017-12-31 16:05:59", 12472099, 0]
["2017-12-31 16:05:59", 12472054, 0]
["2017-12-31 16:06:00", 12472318, 0]
["2017-12-31 16:06:00", 12471979, 0]
I want to use pyspark to get each value in a list or a df of these values after a specified period.
What I have tried:
sc = SparkContext(appName='PythonStreamingDirectKafka')
sc.setLogLevel("WARN")
spark = SparkSession(sc)
ssc = StreamingContext(sc, 10)
brokers, topic = sys.argv[1:]
kvs = KafkaUtils.createDirectStream(ssc, [topic],
{'metadata.broker.list': brokers})
lines = kvs.map(lambda x: x[1])
text = lines.flatMap(lambda line: line.split(" ")).pprint()
ssc.start()
ssc.awaitTermination()
the text variable above is a Dstream object and I can't figure out how to manipulate it or transform it. Been through many blogs and docs.
I want to extract the info into a python list or pandas df so that I can do manipulations on it
Would be really grateful for any help. Thanks~
Solution 1:[1]
into a python list or pandas df
As mentioned in the comments, Spark Structured Streaming will allow you to consume data into a (Spark) dataframe, which is ideally the format you should keep it in, as long as possible, not collect()
data to a list or call toPandas()
https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | OneCricketeer |