'So slow Apache Druid Query
Currently i'm working with an Apache Druid Warehouse that stores near of 300 Millions of rows and have a size of 44GB. We are developing a Flask API that uses Gunicorn and Celery to develop SQL query in Druid. It exist a React App that generate multiple request to the Flask API and then API request Data to Druid in a properly SQL query. Our problem is that Druid response last a lot of time. i.e. When we send near of 50 request to druid, it can take near of 1.3 minutes until return the last response. We work a lot in our front end and the API optimization , however, we suspect that problem is located in Druid datasource.
Our Druid Datasource have the next features:
- Total Data Size 44.01 GB
- Segment size(rows) minimum: 1, average: 0.151M, maximum:0.637M
- Segment Granularity: Day
- Total Rows: 295.465.723
- Avg. Row Size: 148
- Replicate Size: 44.01 GB
- Compaction: Not Enable.
Then we run a query over our datasource and we find that the segment with the greatest amount of rows have 636688 rows and a bytesize of 80859007.
I think that we need to make compactation actions in our datasource with the aim of increase rows per segments, thats according to recomendations about segments in Druid documentation. Before of ingest our datasource again, i want to know if the compactation of segments will enhance the query performance ? Or we need to take another approach about this issue.
Thanks a lot
Solution 1:[1]
Try querying your datasource via API just to check how fast your individual queries return.
curl -X POST 'http://your-druid-server:8082/druid/v2/?pretty' -H 'Content-Type:application/json' -H 'Accept:application/json' -d @/home/your-directory/your_query.json
You can look into optimizing your slow queries first like using relevant time intervals or other tweaks. If it still slow (minutes of query), you can probably try compaction but it's not guaranteed to improve your queries.
Solution 2:[2]
Those are lots of very small segments on average. Reading each segment has some overhead, so it will likely help to do some compaction and try to achieve segments ~ 5 million rows. Each thread in the historical will read a single segment at a time, it is much more efficient if each of these segments holds a significant portion of the data (~500-700 MB).
This section of the documentation discussed the importance of segment size optimization.
Also some other thoughts around query and concurrency optimization:
Do your queries specify a time interval filter?
What are the queries attempting to do?
Is rollup enabled? What is the queryGranularity?
What time granularity is needed for the end user?
How many Historicals do you have? This will affect the parallelism of the query execution.
How are Historicals configured? in particular I'm curious about:
a.druid.processing.numThreads
b.druid.server.http.numThreads
which are set by default based on the available CPUs and therefore determine the parallelism of execution of each historical and the availability of threads to process communication requests.
We can better help you optimize the workload once we understand more about the use case and the resources available to the cluster processes.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Clover |
Solution 2 | Sergio Ferragut |