Kafka cluster collapsed for no reason

tabbi · 30 November 2024 08:32

Hello! We have a 3 broker Kafka cluster ( KRaft ) brokers and Kraft controllers on the same nodes

CPU: 16
RAM: 32GB

We have 2241 topic and 107262 online partitions with 23652 client connections kafka version is 3.6.1

And yesterday we have trouble from 12:08 to 12:11
We have so many logs on all brokers indicated connection troubles ( inter node )

Here is the logs from 1st broker

438]: [2024-11-29 13:08:35,199] INFO [Partition coication.in-42 broker=0] Shrinking IS from 1,0,2 to 0,2. Leader: (highWatermariv
4381: [2024-11-29 13:08:35,207] INFO Partition cldkafka.out-24 broker=0] Shrinking [SR from 2,1,0 to 0. Leader: ChighWatermark: 655805, endOffset:

t.sh[1457438]: [2024-11-29 13:08:45,244] INFO [Partition communication.notificationmanager.sendnotification.in-42 broker=0] IS updated to 0,2 and version updated to 57 t.sh[1457438]: [2024-11-29 13:08:45,273] INFO [ReplicaFetcherManager on broker 0] Removed fetcher for partitions Set(communication.notificationmanager.sendnotification.in-t.sh[1457438]: [2024-11-29 13:08:45,273] INFO [Partition afka.cliaqtokafka.out-24 broker=0] IS updated to © (under-min-isr) and version updated to 59 (kafk t.sh[1457438]: [2024-11-29 13:08:45,460] INFO [ReplicaFetcherManager on broker 0] Removed fetcher for partitions Set(colvir.cliaqtokafka.cliaqtokafka.out-24) (kafka.servers t.sh[1457438]: [2024-11-29 13:08:53,352] INFO [GroupCoordinator 0]: Member consumer 143-52312bed-6b44-41de-b717-9ab7c86fdaa in group MIB3.0_PROD has failed, removing it fi t.sh[1457438]: [2024-11-29 13:08:53,352] INFO [GroupCoordinator 0]: Preparing to rebalance group MIB3.0_PROD in state PreparingRebalance with old generation 2845 (-
_consum
t.sh [1457438]: [2024-11-29 13:08:53,352] INFO [GroupCoordinator 0]: Group hi with generation 2846 is now empty (__consumer_offsets-5) (kafka.coordinator.group.Gro t.sh[1457438]: [2024-11-29 13:08:53,718] INFO [Partition __consumer_offsets-5 broker=0] Shrinking ISR from 0,2,1 to 0. Leader: (highWatermark: 32463865, endOffset: 3246386-t.sh[1457438]: [2024-11-29 13:08:53,747] INFO [Partition consumer_offsets-5 broker=0] IS updated to © (under-min-isr) and version updated to 137 (kafka.cluster.Partition t.sh[1457438]: [2024-11-29 13:08:53,756] WARN [GroupCoordinator 0]: Failed to write empty metadata for group hi: The coordinator is not available. (kafka. coordina t.sh[1457438]: [2024-11-29 13:08:53,964] INFO [ReplicaFetcherManager on broker 0] Removed fetcher for partitions Set __consumer_offsets-5) (kafka.server. ReplicaFetcherMana t.sh[1457438]: [2024-11-29 13:08:53,964] INFO [GroupCoordinator 0]: Elected as the group coordinator for partition 5 in epoch 12 (kafka.coordinator.group.GroupCoordinator) t.sh[1457438]: [2024-11-29 13:08:53,964] INFO [GroupMetadataManager brokerId=0] Scheduling loading of offsets and group metadata from _consumer_offsets-5 for epoch 12 (kas t.sh[1457438]: [2024-11-29 13:08:53,965] INFO [GroupMetadataManager brokerId=0] Already loading offsets and group metadata from __consumer.
_offsets-5 (kafka.coordinator.gros

438]: [2024-11-29 13:08:53,718] INFO [Partition __consumer_offsets-5 broker=0] Shrinking IS from 0,2,1 to 0. Leader: (highWatermark: 32463865, endOffset: 32463866). out of
438]: [2024-11-29 13:08:53,747] INFO [Partition _consumer_offsets-5 broker=0] IS updated to 0 (under-min-isr) and version updated to 137 (kafka.cluster.Partition)
438]: [2024-11-29 13:08:53,756] WARN [GroupCoordinator 0]: Failed to write empty metadata for group hi: The coordinator is not available. (kafka.coordinator.groups
438]: [2024-11-29 13:08:53,964] INFO [ReplicaFetcherManager on broker 0]
Removed fetcher for partitions Set _consumer_offsets-5) (kafka.server. ReplicaFetcherManager)
438]: [2024-11-29 13:08:53,964] INFO [GroupCoordinator 0]: Elected as the group coordinator for partition 5 in epoch 12 (kafka.coordinator.group.GroupCoordinator)
438]: [2024-11-29 13:08:53,964] INFO [GroupMetadataManager brokerId=0] Scheduling loading of offsets and group metadata from _consumer_offsets-5 for epoch 12 (kafka.coords
438]: [2024-11-29 13:08:53,965] INFO [GroupMetadataManager brokerId=0] Already loading offsets and group metadata from __consumer_offsets-5 (kafka. coordinator.group. Groups
438]: [2024-11-29 13:09:03,047] INFO [ReplicaFetcher replicaId=0, leaderId=1, fetcherId=0] Disconnecting from node 1 due to request timeout. (org.apache.kafka.clients.Netw
438]: [2024-11-29 13:09:03,047] INFO [ReplicaFetcher replicaId=, leaderId=1, fetcherId=0] Cancelled in-flight FETCH request with correlation id 34677641 due to node 1 beis
438]: [2024-11-29 13:09:03,047] INFO [ReplicaFetcher replicaId=0, leaderId=1, fetcherId=0] Client requested connection close from node 1 (org.apache.kafka.clients.NetworkC
4381: [2024-11-29 13:09:03, 048] INFO [ReplicaFetcher replicaId=o, leaderId-1, fetcherId-0] Error sending fetch request (sessionId=660052978, epoch=34677641) to node 1: Cora
438]: java.o.IOException: Connection to 1 was disconnected before the response was read
438]:
at org. apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:99)
438]:
at kafka.server.BrokerBlockingSender.sendRequest(BrokerBlockinqSender.scala:113)
438]:
438]:
at kafka.server. RemoteLeaderEndPoint.fetch(RemoteLeaderEndPoint.scala:79) at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:316)
438]:
at kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3(AbstractFetcherThread.scala:130)
438]:
at kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3$adapted(AbstractFetcherThread.scala:129)
438]:
at scala. Option.foreach(Option.scala:437)
438]:
438]:
at kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:129)
at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:112)
4381:
at kafka.server.ReplicaFetcherThread.doWork(ReplicaFetcherThread.scala:98)
4381:
at org.apache.kafka.server.util.ShutdownableThread.run(ShutdownableThread.java:130)
438]: [2024-11-29 13:09:03,050] WARN [ReplicaFetcher replicaId=0, leaderId=1, fetcherId=0] Error in response for fetch request (type=FetchRequest, replicaId-0, maxWait=2005
438]: java.io.IOException: Connection to 1 was disconnected before the response was read
4381:
at org. apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:99)
4381:
at kafka.server.BrokerBlockingSender.sendRequest(BrokerBlockingSender.scala:113)
4381
at kafka.server.RemoteLeaderEndPoint.fetch(RemoteLeaderEndPoint.scala:79)
at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:316)
438]
at kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3(AbstractFetcherThread.scala:130)
4381
at kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3$adapted(AbstractFetcherThread.scala:129)
438]
438]
at scala.Option.foreach(Option.scala: 437)
438]
at kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:129)
438]:
at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:112)
at kafka.server. ReplicaFetcherThread.doWork(ReplicaFetcherThread.scala:98)
4381:
at org. apache. kafka.server.util.ShutdownableThread.run(ShutdownableThread. java: 130)
438]: [2024-11-29 13:09:05,589] INFO [ReplicaFetcher replicaId=0, leaderId=2, fetcherId-0] Disconnecting from node 2 due to request timeout. Corg.apache.kafka.clients.Netw
438]: [2024-11-29 13:09:05,589] INFO [ReplicaFetcher replicaId=0, leaderId=2, fetcherId=0] Cancelled in-flight FETCH request with correlation id 38762087 due to node 2 beis
438]: [2024-11-29 13:09:05,589] INFO [ReplicaFetcher replicaId=0, leaderId-2, fetcherId=0] Client requested connection close from node 2 (org.apache. kafka.clients. NetworkC
438]: [2024-11-29 13:09:05,590] INFO [ReplicaFetcher replicaId=0, leaderId=2, fetcherId=0] Error sending fetch request (sessionId=1014113766, epoch=38762087) to node 2: (os
438]: lava.io. IOException: Connection to 2 was disconnected before the response was read
4381:
at org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:99)

And this logs appears on remaining 2 brokers nodes, it seems that kafka brokers is lost connection between nodes, but ssh and other traffic to/from nodes was worked! And on the network side there is no problems, What else could have caused the loss of connection with all brokers? Kafka itself worked, we didn’t reboot it and the problem solved itself

there is no high load on brokers on CPU,RAM,I/O but the first brokers have LA twice bigger that other 2 nodes ( first broker is not a KRaft leader! )
but i don’t see any problems related to kafka/kraft in the logs, only connections issues

Topic		Replies	Views
Kafka Timeout Issues on 1 Node KRaft Cluster in Combined Mode Ops	0	188	13 February 2025
Kafka (KRaft) inter-broker controller DNS failure Ops	0	2829	17 July 2023
KAFKA cluster with kraft Kafka Streams	5	11010	31 August 2023
Can kafka-3.3.1 and kafka-0.10.0.1 coexist on one node? Cluster Replication	26	5297	4 November 2022
Broker resync some partitions catch up very slowly Ops	1	1019	23 February 2024

Kafka cluster collapsed for no reason

Related topics