'How do I avoid data loss when using Kafka over Kubernetes and one of the nodes fail?

My application runs over a Kubernetes cluster of 3 nodes and uses Kafka to stream data. I am trying to check my system's ability to recover from node failure, so I deliberately fail one of the nodes for 1 minute.

Around 50% of the times, I experience data loss of a single data record after the node failure. If the controller Kafka broker was running on the failed node, I see that a new controller broker was elected as expected. When the data loss occur, I see the following error in the new controller broker log:

ERROR [Controller id=2 epoch=13] Controller 2 epoch 13 failed to change state for partition __consumer_offsets-45 from OfflinePartition to OnlinePartition (state.change.logger) [controller-event-thread]

I am not sure if that's the problem, but searching the web for information about this error made me suspect that I need to configure Kafka to have more than 1 replica for each topic. This is how my topics/partitions/replicas configuration looks like:

My questions: Is my suspicion that more replicas are required is correct?

If yes, how do I increase the number of topics replicas? I played around with a few broker parameters such as default.replication.factor and replication.factor but I did not see the number of replicas change.

If no, what is the meaning of this error log?

Thanks!

Solution 1:^[1]

Yes, if the broker hosting the single replica goes down, then you can expect an unclean topic. If you have unclean leader election disabled, however, you shouldn't lose data that's already been persisted to the broker.

To modify existing topics, you must use kafka-reassign-partitions tool, not any of the broker settings, as those only apply for brand new topics. Kafka | Increase replication factor of multiple topics

Ideally, you should disable auto topic creation, as well, to force clients to use Topic CRD resources in Strimzi that include a replication factor, and you can use other k8s tools to verify that they have values greater than 1.

Solution 2:^[2]

Yes, you're right, you need to set the replication factor to more than 1 to be able to sustain the broker-level failures. Once you add this value as default, the new topics will start having the configured number of replicas. But for existing topics, you need to follow the below instruction-

Describe the topic

$ ./bin/kafka-topics.sh --bootstrap-server localhost:9092 --describe --topic one
 Topic: one  PartitionCount: 3   ReplicationFactor: 1    Configs: segment.bytes=1073741824
 Topic: one  Partition: 0    Leader: 1   Replicas: 1 Isr: 1
 Topic: one  Partition: 1    Leader: 0   Replicas: 0 Isr: 0
 Topic: one  Partition: 2    Leader: 2   Replicas: 2 Isr: 2

Create the json file with the topic reassignment details

$ cat >>increase.json <<EOF
{
 "version":1,
 "partitions":[
    {"topic":"one","partition":0,"replicas":[0,1,2]},
    {"topic":"one","partition":1,"replicas":[1,0,2]},
    {"topic":"one","partition":2,"replicas":[2,1,0]},
 ]
}
EOF

Execute this reassignment plan

$ ./bin/kafka-reassign-partitions.sh --bootstrap-server localhost:9092 --reassignment-json-file increase.json --execute
Current partition replica assignment

{"version":1,"partitions":[{"topic":"one","partition":0,"replicas":[0,1,2],"log_dirs":["any","any"]},{"topic":"one","partition":1,"replicas":[1,0,2],"log_dirs":["any","any"]},{"topic":"one","partition":2,"replicas":[2,1.0],"log_dirs":["any","any"]}]}

Save this to use as the --reassignment-json-file option during rollback Successfully started partition reassignments for one-0,one-1,one-2

Describe the topic again

$ ./bin/kafka-topics.sh --bootstrap-server localhost:9092 --describe --topic one
 Topic: one  PartitionCount: 3   ReplicationFactor: 3    Configs: segment.bytes=1073741824
  Topic: one  Partition: 0    Leader: 0   Replicas: 0,1,2 Isr: 0,1,2
  Topic: one  Partition: 1    Leader: 1   Replicas: 1,0,2 Isr: 1,0,2
  Topic: one  Partition: 2    Leader: 2   Replicas: 2,1,0 Isr: 2,1,0

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	OneCricketeer
Solution 2	AP.

'How do I avoid data loss when using Kafka over Kubernetes and one of the nodes fail?

Solution 1:[1]

Solution 2:[2]

Sources

Related Questions

Solution 1:^[1]

Solution 2:^[2]