'Spark streamming take long time read from kafka
I build a cluster use CDH5.14.2, includes 5 nodes, each node has 130G momery and 40 cpu cores. I builded the spark streamming application to read from multiple kafka topic, about 10 kafka topics, and aggregate the kafka message separately. And save the kafka offset into zookeeper finally. Finally i found spark task take long time to process kafka message. The kafka message is not skew, and i found spark take long to read from kafka.
My code script:
// build input steeam from kafka topic
JavaInputDStream<ConsumerRecord<String, String>> stream1 = MyKafkaUtils.
buildInputStream(KafkaConfig.kafkaFlowGrouppId, topic1, ssc);
JavaInputDStream<ConsumerRecord<String, String>> stream2 = MyKafkaUtils.
buildInputStream(KafkaConfig.kafkaFlowGrouppId, topic2, ssc);
JavaInputDStream<ConsumerRecord<String, String>> stream3 = MyKafkaUtils.
buildInputStream(KafkaConfig.kafkaFlowGrouppId, topic3, ssc);
...
// aggregate kafka message use spark sql
result1 = process(stream1);
result2 = process(stream2);
result3 = process(stream3);
...
// write result to kafka kafka
writeToKafka(result1);
writeToKafka(result2);
writeToKafka(result3);
// save offset to zookeeper
saveOffset(stream1);
saveOffset(stream2);
saveOffset(stream3);
spark web ui information: enter image description here
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|