'Custom Sink to Snowflake using Snowflake JDBC driver is very slow
I am using Spring Cloud Data Flow to create a custom stream to load data into Snowflake. I have written a custom sink to load data into Snowflake using Snowflake's JDBC driver. The methodology that I used is similar to any database update using the following steps:
- Create a connection pool (used HikariCP) to obtain Snowflake database connection.
- Using prepared statement, created a batch of rows to commit all at once.
- Using a scheduled timer committed the batch to snowflake.
This is when I noticed that the batch is being updated very slowly in Snowflake - i.e. one or two records at a time and a batch of 8K rows took well over 45 minutes to update in Snowflake table (using a XS warehouse).
My question: Is there a better/another/recommended method to stream data into Snowflake? I am aware of Kafka connector to Snowflake and Snowpipes (which use an internal/external stage) but these are not the options we would like to pursue.
PreparedStatement preparedStatement = null;
Connection conn = null;
String compiledQuery = "INSERT INTO " + env.getProperty("snowtable") + " SELECT parse_json (column1) FROM VALUES (?)";
conn = DataSource.getConnection();
preparedStatement = conn.prepareStatement(compiledQuery);
for(int i = 0; i<messageslocal.size(); i++) {
preparedStatement.setString(1, messageslocal.get(i));
preparedStatement.addBatch();
}
preparedStatement.executeBatch();
Thank you!
Solution 1:[1]
Generally speaking Snowflake - like many column-store or hybrid store DBs - is not performing well for single or small number of lines insert. So the poor performance you experience does not look strange to me, especially on an XS WH.
Without knowing the context of your task, I would suggest to write to a JSON, PARQUET or CSV file (stored on S3 if you're in AWS) instead of writing directly to Snowflake through JDBC. You can make that JSON/PARQUET/CSV file available through a Stage in Snowflake.
Then you can either write a process that copies the Stage data to a table, or put a materialized view on top of the Stage. The materialized view will more or less do the equivalent to triggering the extract of the JSON/PARQUET/CSV data into a Snowflake Table, but this would operate asynchronously without impacting your application performance.
Solution 2:[2]
In addition to the great answer by @JeromeE, you should also try using multi-row insert. What you have in your code is a batch with individual inserts.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | JeromeE |
Solution 2 | Haris Osmanagi? |