After maintenance kafka broker not recovered after long period

vinod · 25 August 2025 04:59

Hey Everyone,

We recently faced an issue with one of our Apache Kafka brokers due to a disk failure. After performing maintenance, we added the broker back to the cluster using the same node ID.

Once re-added, the broker experienced a continuous increase in disk I/O and load average. The ISR process kept expanding and shrinking repeatedly, and despite waiting more than 6 hours, it has not stabilized. This has impacted producers and consumers from publishing and fetching data effectively.

Here are some of the repeating ISR flapping logs we observed:

[2025-08-21 17:37:30,030] INFO [Partition af-check-json-hkg-20221012-29 broker=24] Shrinking ISR from 24,28,0 to 24,28. Leader: (highWatermark: 355649, endOffset: 355650). Out of sync replicas: (brokerId: 0, endOffset: 355649). (kafka.cluster.Partition)
[2025-08-21 17:39:23,293] INFO [Partition af-check-json-hkg-20221012-29 broker=24] Expanding ISR from 24,28 to 24,28,0 (kafka.cluster.Partition)
[2025-08-21 18:23:26,325] INFO [Partition af-check-json-hkg-20221012-29 broker=24] Shrinking ISR from 24,28,0 to 24. Leader: (highWatermark: 355650, endOffset: 355651). Out of sync replicas: (brokerId: 28, endOffset: 355650) (brokerId: 0, endOffset: 355650). (kafka.cluster.Partition)
[2025-08-21 18:32:51,103] INFO [Partition af-check-json-hkg-20221012-29 broker=24] Expanding ISR from 24 to 24,0 (kafka.cluster.Partition)
[2025-08-21 18:41:24,220] INFO [Partition af-check-json-hkg-20221012-29 broker=24] Expanding ISR from 24,0 to 24,0,28 (kafka.cluster.Partition)
[2025-08-21 18:44:19,693] INFO [Partition af-check-json-hkg-20221012-29 broker=24] Shrinking ISR from 24,0,28 to 24. Leader: (highWatermark: 355652, endOffset: 355653). Out of sync replicas: (brokerId: 0, endOffset: 355652) (brokerId: 28, endOffset: 355652). (kafka.cluster.Partition)
[2025-08-21 19:23:30,255] INFO [Partition af-check-json-hkg-20221012-29 broker=24] Expanding ISR from 24 to 24,0 (kafka.cluster.Partition)
[2025-08-21 19:23:30,789] INFO [Partition af-check-json-hkg-20221012-29 broker=24] Expanding ISR from 24,0 to 24,0,28 (kafka.cluster.Partition)
[2025-08-21 20:13:47,365] INFO [Partition af-check-json-hkg-20221012-29 broker=24] Shrinking ISR from 24,0,28 to 24. Leader: (highWatermark: 355653, endOffset: 355654). Out of sync replicas: (brokerId: 0, endOffset: 355653) (brokerId: 28, endOffset: 355653). (kafka.cluster.Partition)
[2025-08-21 20:29:34,474] INFO [Partition af-check-json-hkg-20221012-29 broker=24] Expanding ISR from 24 to 24,28 (kafka.cluster.Partition)
[2025-08-21 20:29:34,524] INFO [Partition af-check-json-hkg-20221012-29 broker=24] Expanding ISR from 24,28 to 24,28,0 (kafka.cluster.Partition)

We also noticed controlled shutdown retries on the broker:

[2025-08-20 14:46:30,815] INFO [KafkaServer id=24] Remaining partitions to move: [RemainingPartition(topicName=‘hkg-production.event-acc-log-20220517’, partitionIndex=35)] (kafka.server.KafkaServer)
[2025-08-20 14:46:30,815] INFO [KafkaServer id=24] Error from controller: NONE (kafka.server.KafkaServer)
[2025-08-20 14:46:35,815] WARN [KafkaServer id=24] Retrying controlled shutdown after the previous attempt failed… (kafka.server.KafkaServer)

Has anyone experienced similar ISR flapping issues after re-adding a broker with the same node ID? Any guidance or best practices on how to stabilize the broker and prevent continuous ISR churn would be greatly appreciated.
apache kafka version: 2.3.0

mjsax · 25 August 2025 14:48

Sounds like a question for Ops? Ops - Confluent Community

Kafka Streams is, well, about Kafka Streams, a Java stream processing library: Kafka Streams Quick Start for Confluent Platform | Confluent Documentation

Topic		Replies	Views
Kafka expandind and shrinking ISR Ops	0	3526	10 December 2021
Kafka Brokers problem Kafka Streams	2	2153	18 January 2024
Kafka Broker problem Non-Java Clients	0	1170	18 January 2024
Kafka streams application getting stuck after broker restarts Kafka Streams	3	5003	19 May 2021
Kafka broker failure not happening when network is partitioned Cluster Replication	8	89	4 April 2025

After maintenance kafka broker not recovered after long period

Related topics