'Spark streamming take long time read from kafka

I build a cluster use CDH5.14.2, includes 5 nodes, each node has 130G momery and 40 cpu cores. I builded the spark streamming application to read from multiple kafka topic, about 10 kafka topics, and aggregate the kafka message separately. And save the kafka offset into zookeeper finally. Finally i found spark task take long time to process kafka message. The kafka message is not skew, and i found spark take long to read from kafka.

My code script:

// build input steeam from kafka topic
JavaInputDStream<ConsumerRecord<String, String>> stream1 = MyKafkaUtils.
    buildInputStream(KafkaConfig.kafkaFlowGrouppId, topic1, ssc);
JavaInputDStream<ConsumerRecord<String, String>> stream2 = MyKafkaUtils.
    buildInputStream(KafkaConfig.kafkaFlowGrouppId, topic2, ssc);
JavaInputDStream<ConsumerRecord<String, String>> stream3 = MyKafkaUtils.
    buildInputStream(KafkaConfig.kafkaFlowGrouppId, topic3, ssc);
...

// aggregate kafka message use spark sql
result1 = process(stream1);
result2 = process(stream2);
result3 = process(stream3);
...

// write result to kafka kafka
writeToKafka(result1);
writeToKafka(result2);
writeToKafka(result3);

// save offset to zookeeper
saveOffset(stream1);
saveOffset(stream2);
saveOffset(stream3);

spark web ui information: enter image description here



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source