'Need to load data from Hadoop to Druid after applying transformations. If I use Spark, can we load data from Spark RDD or dataframe to Druid directly?

I have data present in hive tables. I want to apply bunch of transformations before loading that data into druid. So there are ways but I'm not sure about those. 1. Save that table after applying transformation and then Bulk load through hadoop ingestion method. But i want to avoid extra write on the server. 2. Using tranquility. But it is for Spark Streaming and only for Scala and Java, not for Python. Am I right on this?

Is there any other way I can achieve this?



Solution 1:[1]

You can achieve it by using druid kafka integration.

I think you should read data from tables in spark apply transformation and then write back it to kafka stream. Once you setup druid kafka integration it will read data from kafka and will push to druid datasource.

Here is documentation about druid kafka integration https://druid.apache.org/docs/latest/tutorials/tutorial-kafka.html

Solution 2:[2]

(Disclaimer: I am a contributor for rovio-ingest)

With rovio-ingest you can batch ingest a Hive table to Druid with Spark. This avoids the extra write.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 jaimin03
Solution 2