'Processing data from a kafka stream using Pyspark

What the console of the kafka consumer looks like:

["2017-12-31 16:06:01", 12472391, 1]
["2017-12-31 16:06:01", 12472097, 1]
["2017-12-31 16:05:59", 12471979, 1]
["2017-12-31 16:05:59", 12472099, 0]
["2017-12-31 16:05:59", 12472054, 0]
["2017-12-31 16:06:00", 12472318, 0]
["2017-12-31 16:06:00", 12471979, 0]

I want to use pyspark to get each value in a list or a df of these values after a specified period.

What I have tried:

sc = SparkContext(appName='PythonStreamingDirectKafka')
sc.setLogLevel("WARN")
spark = SparkSession(sc)
ssc = StreamingContext(sc, 10)

brokers, topic = sys.argv[1:]


kvs = KafkaUtils.createDirectStream(ssc, [topic],
    {'metadata.broker.list': brokers})


lines = kvs.map(lambda x: x[1])
text =  lines.flatMap(lambda line: line.split(" ")).pprint()

ssc.start()
ssc.awaitTermination()

the text variable above is a Dstream object and I can't figure out how to manipulate it or transform it. Been through many blogs and docs.

I want to extract the info into a python list or pandas df so that I can do manipulations on it

Would be really grateful for any help. Thanks~



Solution 1:[1]

into a python list or pandas df

As mentioned in the comments, Spark Structured Streaming will allow you to consume data into a (Spark) dataframe, which is ideally the format you should keep it in, as long as possible, not collect() data to a list or call toPandas()

https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 OneCricketeer