'How to write Spark SQL batch job results to the Apache Druid?

I want to write Spark batch results data to the Apache Druid. I know Druid has native batch ingestions such as index_parallel. Druid runs Map-Reduce jobs in the same cluster. But I only want to use Druid as a data storage. I want to aggregate data external Spark cluster, then send it to the Druid cluster.

Druid has Tranquility for real-time ingestion. I can send batch data using Tranquility, but this is not efficient. How can I send batch results to the Druid efficiently?



Solution 1:[1]

You can write to Kafka topic and run Kafka Indexing Job to indexing it.

We have been using this mechanism for indexing data but there is no such restriction of windowPeriod in that. It takes even older timestamp. But if a shard is already finalized, this ends up creating new shards in same segment.

e.g. if I am using day size segment and I will get to shards in that segment segment-11-11-2019-1 100MB segment-11-11-2019-2 10MB ( for data received on 12th Nov with event time for 11th Nov ).

With compaction, these two shards will be merged with auto compaction turned on.

https://druid.apache.org/docs/latest/development/extensions-core/kafka-ingestion.html

https://druid.apache.org/docs/latest/tutorials/tutorial-compaction.html

Or simply you can accumulate results in HDFS and then use Hadoop Batch ingestion using cron jobs. Auto compaction works well for this option too.

Solution 2:[2]

(Disclaimer: I am a contributor for rovio-ingest)

The rovio-ingest library allows ingesting any Spark dataset to Druid. The processing happens on a Spark cluster (that can be separate from the Druid server, as you required). Druid will discover the new/updated segments in the metadata storage. Writing big batches with this library is efficient, because the data is partitioned so that segment writing can be parallelized by Spark.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Vikram Patil
Solution 2