Hey Everyone,
We recently faced an issue with one of our Apache Kafka brokers due to a disk failure. After performing maintenance, we added the broker back to the cluster using the same node ID.
Once re-added, the broker experienced a continuous increase in disk I/O and load average. The ISR process kept expanding and shrinking repeatedly, and despite waiting more than 6 hours, it has not stabilized. This has impacted producers and consumers from publishing and fetching data effectively.
Here are some of the repeating ISR flapping logs we observed:
[2025-08-21 17:37:30,030] INFO [Partition af-check-json-hkg-20221012-29 broker=24] Shrinking ISR from 24,28,0 to 24,28. Leader: (highWatermark: 355649, endOffset: 355650). Out of sync replicas: (brokerId: 0, endOffset: 355649). (kafka.cluster.Partition)
[2025-08-21 17:39:23,293] INFO [Partition af-check-json-hkg-20221012-29 broker=24] Expanding ISR from 24,28 to 24,28,0 (kafka.cluster.Partition)
[2025-08-21 18:23:26,325] INFO [Partition af-check-json-hkg-20221012-29 broker=24] Shrinking ISR from 24,28,0 to 24. Leader: (highWatermark: 355650, endOffset: 355651). Out of sync replicas: (brokerId: 28, endOffset: 355650) (brokerId: 0, endOffset: 355650). (kafka.cluster.Partition)
[2025-08-21 18:32:51,103] INFO [Partition af-check-json-hkg-20221012-29 broker=24] Expanding ISR from 24 to 24,0 (kafka.cluster.Partition)
[2025-08-21 18:41:24,220] INFO [Partition af-check-json-hkg-20221012-29 broker=24] Expanding ISR from 24,0 to 24,0,28 (kafka.cluster.Partition)
[2025-08-21 18:44:19,693] INFO [Partition af-check-json-hkg-20221012-29 broker=24] Shrinking ISR from 24,0,28 to 24. Leader: (highWatermark: 355652, endOffset: 355653). Out of sync replicas: (brokerId: 0, endOffset: 355652) (brokerId: 28, endOffset: 355652). (kafka.cluster.Partition)
[2025-08-21 19:23:30,255] INFO [Partition af-check-json-hkg-20221012-29 broker=24] Expanding ISR from 24 to 24,0 (kafka.cluster.Partition)
[2025-08-21 19:23:30,789] INFO [Partition af-check-json-hkg-20221012-29 broker=24] Expanding ISR from 24,0 to 24,0,28 (kafka.cluster.Partition)
[2025-08-21 20:13:47,365] INFO [Partition af-check-json-hkg-20221012-29 broker=24] Shrinking ISR from 24,0,28 to 24. Leader: (highWatermark: 355653, endOffset: 355654). Out of sync replicas: (brokerId: 0, endOffset: 355653) (brokerId: 28, endOffset: 355653). (kafka.cluster.Partition)
[2025-08-21 20:29:34,474] INFO [Partition af-check-json-hkg-20221012-29 broker=24] Expanding ISR from 24 to 24,28 (kafka.cluster.Partition)
[2025-08-21 20:29:34,524] INFO [Partition af-check-json-hkg-20221012-29 broker=24] Expanding ISR from 24,28 to 24,28,0 (kafka.cluster.Partition)
We also noticed controlled shutdown retries on the broker:
[2025-08-20 14:46:30,815] INFO [KafkaServer id=24] Remaining partitions to move: [RemainingPartition(topicName=‘hkg-production.event-acc-log-20220517’, partitionIndex=35)] (kafka.server.KafkaServer)
[2025-08-20 14:46:30,815] INFO [KafkaServer id=24] Error from controller: NONE (kafka.server.KafkaServer)
[2025-08-20 14:46:35,815] WARN [KafkaServer id=24] Retrying controlled shutdown after the previous attempt failed… (kafka.server.KafkaServer)
Has anyone experienced similar ISR flapping issues after re-adding a broker with the same node ID? Any guidance or best practices on how to stabilize the broker and prevent continuous ISR churn would be greatly appreciated.
apache kafka version: 2.3.0