'Flink Tableconfig setIdleStateRetention seems to be not working
I have a Kafka stream and Hive table that I want to use as lookup table to enhance data from kafka. Hive table points to parquet file in S3. Hive table is updated once a day with INSERT OVERWRITE statement, which means, older files from that s3 path will be replaced by newer files once a day.
Everytime, hive table is updated, newer data from hive table is joined with historical data from kafka and this results in older kafka data getting republished. I see this is the expected behaviour from this link.
I tried to set idle state retention of 2 days as shown below, but, it looks Flink is not honoring the 2 days idle state retention and seems to be keeping all the kafka records in table state. I was expecting only last 2 days data will be republished at the time hive table is updated. My job has been running for one month and instead, I see record old as one month still getting sent in the output. I think this will make the state grow forever and might result in out of memory exception at some point.
One possible reason for this is I think Flink keeps the state of kafka data keyed by sales_customer_id field because that is the field used to join with hive table and as soon as another sales come for that customer id, then state expiry is extended for another 2 days? I am not sure whether this is the reason but wanted to check with Flink expert on what could be the possible problem here.
StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env);
TableConfig tableConfig = tableEnv.getConfig();
Configuration configuration = tableConfig.getConfiguration();
tableConfig.setIdleStateRetention(Duration.ofHours(24*2));
configuration.setString("table.dynamic-table-options.enabled", "true");
DataStream<Sale> salesDataStream = ....;
Table salesTable = tableEnv.fromDataStream(salesDataStream);
Table customerTable = tableEnv.sqlQuery("select * from my_schema.customers" +
" /*+ OPTIONS('streaming-source.enable'='true', 'streaming-source.partition-order'='create-time') */");
Table resultTable = salesTable.leftOuterJoin(customerTable, $("sales_customer_id").isEqual($("customer_id")));
DataStream<Sale> salesWithCustomerInfoDataStream = tableEnv.toRetractStream(resultTable, Row.class).map(new RowToSaleFunction());
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|