How to Prevent Data Loss When a Stale ISR Replica Becomes Leader?

Context:

I’m working with Apache Kafka using the following setup:

  • replication.factor = 3
  • min.insync.replicas = 2
  • Producer is configured with acks=all

Let’s assume:

  • Broker 101 is the leader, and brokers 102 and 103 are followers.
  • ISR at time T0 is [101, 102, 103].
  • A new message M1 is written to Kafka. It gets replicated to 101 (leader) and 102 (follower) quickly.
  • However, 103 has not fetched the message yet, but it’s still in ISR because replica.lag.time.max.ms hasn’t expired.

Now, suppose:

  • Broker 101 crashes suddenly.
  • Kafka elects a new leader from the current ISR → picks 103.
  • Since 103 never fetched the message M1, it becomes leader and truncates its log to the last known HW — resulting in loss of M1, even though the producer already got a success ack!

:red_exclamation_mark:Problem:

This seems to violate durability guarantees of acks=all. The message was acknowledged but lost because a stale ISR became leader.


:brain: My Questions:

  1. Is this behavior expected in Kafka’s current replication model?
  2. What’s the recommended way to prevent this type of data loss?
  • Tuning replica.lag.time.max.ms?
  • Matching min.insync.replicas to the replication.factor?
  • Any newer improvements in KRaft mode or Raft-based replication?
  1. Are there any known trade-offs in availability vs durability in enforcing tighter ISR behavior?

This assumptions seems off. Given that 103 did not replicate the message yet and it’s in IRS, the ack did not happen yet. 101 would only ack M1 back to the producer after both 102 and 103 replicated the message.

Thus, if 103 would get elected as leader, the ack will never happen, and 102 would truncate M1. What is fine, as the producer never get an ack. – On the other hand, if 102 would become leader, 103 can still replicate the message and 102 would send the ack to the producer after 103 did get M1.

1 Like