How to Prevent Data Loss When a Stale ISR Replica Becomes Leader?

harshitrathod1 · 14 June 2025 22:00

Context:

I’m working with Apache Kafka using the following setup:

replication.factor = 3
min.insync.replicas = 2
Producer is configured with acks=all

Let’s assume:

Broker 101 is the leader, and brokers 102 and 103 are followers.
ISR at time T0 is [101, 102, 103].
A new message M1 is written to Kafka. It gets replicated to 101 (leader) and 102 (follower) quickly.
However, 103 has not fetched the message yet, but it’s still in ISR because replica.lag.time.max.ms hasn’t expired.

Now, suppose:

Broker 101 crashes suddenly.
Kafka elects a new leader from the current ISR → picks 103.
Since 103 never fetched the message M1, it becomes leader and truncates its log to the last known HW — resulting in loss of M1, even though the producer already got a success ack!

Problem:

This seems to violate durability guarantees of acks=all. The message was acknowledged but lost because a stale ISR became leader.

My Questions:

Is this behavior expected in Kafka’s current replication model?
What’s the recommended way to prevent this type of data loss?

Tuning replica.lag.time.max.ms?
Matching min.insync.replicas to the replication.factor?
Any newer improvements in KRaft mode or Raft-based replication?

Are there any known trade-offs in availability vs durability in enforcing tighter ISR behavior?

mjsax · 16 June 2025 16:05

This assumptions seems off. Given that 103 did not replicate the message yet and it’s in IRS, the ack did not happen yet. 101 would only ack M1 back to the producer after both 102 and 103 replicated the message.

Thus, if 103 would get elected as leader, the ack will never happen, and 102 would truncate M1. What is fine, as the producer never get an ack. – On the other hand, if 102 would become leader, 103 can still replicate the message and 102 would send the ack to the producer after 103 did get M1.

Topic		Replies	Views
Why losing messages on read? Architecture and Design	9	6551	16 September 2021
Kafka multi-datacenter solution Ops	12	2782	3 February 2023
🎧 Using Kafka-Leader-Election to Improve Scalability and Performance News and Blogs	0	2628	12 January 2023
What happens to Kafka if ZooKeeper quorum is lost? Ops	2	6912	8 February 2021
Can a Kafka consumer read messages immediately when acks=0, before the leader syncs with followers? Clients	4	114	12 February 2025

How to Prevent Data Loss When a Stale ISR Replica Becomes Leader?

Context:

Problem:

My Questions:

Related topics