Hello,
We are experiencing timeout issues in our staging Kafka 1-node cluster running in combined mode. The cluster is facing multiple disconnections and timeout errors, particularly with the BROKER_HEARTBEAT
requests.
Here are the logs from the Broker:
[2025-01-12 06:41:19,297] INFO [NodeToControllerChannelManager id=1 name=heartbeat] Cancelled in-flight BROKER_HEARTBEAT request with correlation id 237778 due to node 1 being disconnected (elapsed time since creation: 4505ms, elapsed time since send: 4505ms, request timeout: 4500ms) (org.apache.kafka.clients.NetworkClient)
[2025-01-12 06:41:23,900] INFO [NodeToControllerChannelManager id=1 name=heartbeat] Disconnecting from node 1 due to request timeout. (org.apache.kafka.clients.NetworkClient)
[2025-01-12 06:41:23,901] INFO [NodeToControllerChannelManager id=1 name=heartbeat] Cancelled in-flight BROKER_HEARTBEAT request with correlation id 237779 due to node 1 being disconnected (elapsed time since creation: 4503ms, elapsed time since send: 4502ms, request timeout: 4500ms) (org.apache.kafka.clients.NetworkClient)
[2025-01-12 06:41:31,223] INFO [NodeToControllerChannelManager id=1 name=heartbeat] Disconnecting from node 1 due to request timeout. (org.apache.kafka.clients.NetworkClient)
[2025-01-12 06:41:31,223] INFO [NodeToControllerChannelManager id=1 name=heartbeat] Cancelled in-flight BROKER_HEARTBEAT request with correlation id 237783 due to node 1 being disconnected (elapsed time since creation: 4505ms, elapsed time since send: 4505ms, request timeout: 4500ms) (org.apache.kafka.clients.NetworkClient)
[2025-01-12 06:41:53,014] INFO [NodeToControllerChannelManager id=1 name=heartbeat] Disconnecting from node 1 due to request timeout. (org.apache.kafka.clients.NetworkClient)
[2025-01-12 06:41:53,014] INFO [NodeToControllerChannelManager id=1 name=heartbeat] Cancelled in-flight BROKER_HEARTBEAT request with correlation id 237789 due to node 1 being disconnected (elapsed time since creation: 4503ms, elapsed time since send: 4503ms, request timeout: 4500ms) (org.apache.kafka.clients.NetworkClient)
I came across a post about KAFKA-14392 which mentions that KRaft brokers maintain their liveness in the cluster by sending BROKER_HEARTBEAT
requests to the active controller. If a broker does not send the heartbeat within the period defined by broker.session.timeout.ms
, the active controller fences the broker.
I have already updated the timeout values in the configuration, but the issue still persists. The current settings are:
controller.quorum.request.timeout.ms=5000
controller.quorum.election.timeout.ms=3000
broker.session.timeout.ms=9000
controller.socket.timeout.ms=30000
controller.quorum.fetch.timeout.ms=5000
request.timeout.ms=40000
connections.max.idle.ms=540000
rebalance.timeout.ms=60000
broker.heartbeat.interval.ms=5000
worker.sync.timeout.ms=3000
worker.unsync.backoff.ms=300000
reconnect.backoff.max.ms=1000
I also found an issue in another post regarding quorum voters in KRaft clusters. They suggested that the controller.quorum.voters
configuration should be reviewed and correctly configured in the format {id}@{host}:{port}
. I verified this in the documentation, and for a cluster with three nodes, it should look like:
KAFKA_CONTROLLER_QUORUM_VOTERS=1@broker01.abc.net:29093,2@broker02.abc.net:29093,3@broker03.abc.net:29093
In my case, I only have one node running, so I updated the server.properties
configuration and restarted the Kafka container. The key change was increasing request.timeout.ms
from 40000 to 60000. Here is the configuration I updated:
request.timeout.ms=60000 // Increased from 40000
Additionally, I ran a Python script to produce and consume messages from the test-topic
connected to the broker. The broker stopped after several requests, causing the controller to wait for the broker to restart, as there was no other node to switch to as the leader.
The broker logs show several BROKER_HEARTBEAT
messages being cancelled when the broker is stopped and restarted, indicating that the issue might be related to the request and session timeouts when the broker is temporarily unavailable.
Here is the message I found in the broker logs:
Cancelled in-flight BROKER_HEARTBEAT request with correlation id 2130 due to node 1
Finally, I noticed the replication factor for the offsets.topic
and transaction.state.log
topics was set to 1. To test if increasing the replication factor helps, I changed these values to 3 in the configuration:
offsets.topic.replication.factor=3
transaction.state.log.replication.factor=3
server.properties:
sasl.enabled.mechanisms=PLAIN
sasl.mechanism.inter.broker.protocol=SASL_SSL
sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required username="XXXXXX" password="YYYYYY";
super.users=User:admin
listeners=SASL_SSL://:9094,SSL://:9093,CONTROLLER://:29093
advertised.listeners=SASL_SSL://broker01.abc.net:9094,SSL://broker01.abc.net:9093
spring.kafka.bootstrap-servers=broker01.abc.net:9094
controller.listener.names=CONTROLLER
listener.security.protocol.map=CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT,SSL:SSL,SASL_PLAINTEXT:SASL_PLAINTEXT,SASL_SSL:SASL_SSL
ssl.client.auth=required
ssl.key.password=XXXXXXX
ssl.keystore.location=/etc/kafka/secrets/kafka.keystore.jks
ssl.keystore.password=XXXXXX
ssl.truststore.location=/etc/kafka/secrets/kafka.truststore.jks
ssl.truststore.password=XXXXXX
offsets.topic.replication.factor=3
transaction.state.log.replication.factor=3
transaction.state.log.min.isr=1
group.initial.rebalance.delay.ms=0
process.roles=broker,controller
log.retention.hours=168
log.segment.bytes=1073741824
log.retention.check.interval.ms=300000
num.network.threads=3
num.io.threads=8
queued.max.requests=500
replica.fetch.max.bytes=1048576
replica.fetch.wait.max.ms=500
replica.selector.class=kafka.server.SimplePartitioner
ssl.endpoint.identification.algorithm=
controller.quorum.request.timeout.ms=5000
controller.quorum.election.timeout.ms=3000
broker.session.timeout.ms=9000
controller.socket.timeout.ms=30000
controller.quorum.fetch.timeout.ms=5000
request.timeout.ms=60000
connections.max.idle.ms=540000
rebalance.timeout.ms=60000
broker.heartbeat.interval.ms=10000
worker.sync.timeout.ms=3000
worker.unsync.backoff.ms=300000
reconnect.backoff.max.ms=1000
kafka.properties
replica.fetch.max.bytes=1152921504
ssl.keystore.filename=kafka.keystore.jks
super.users=User:admin
default.replication.factor=1
transaction.state.log.min.isr=1
ssl.key.credentials=kafka_ssl_key_creds
process.roles=broker,controller
security.inter.broker.protocol=SASL_SSL
controller.listener.names=CONTROLLER
controller.quorum.voters=1@broker01.abc.net:29093
message.max.bytes=1152921504
auto.create.topics.enable=false
ssl.key.password=XXXXXX
ssl.truststore.password=XXXXXX
ssl.keystore.type=JKS
log.retention.ms=604800000
advertised.listeners=SASL_SSL://broker01.abc.net:9094,SSL://broker01.abc.net:9093
sasl.enabled.mechanisms=PLAIN
kraft.mode=true
log4j.logger.org.apache.kafka=DEBUG
listener.security.protocol.map=CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT,SSL:SSL,SASL_PLAINTEXT:SASL_PLAINTEXT,SASL_SSL:SASL_SSL
ssl.truststore.filename=kafka.truststore.jks
fetch.message.max.bytes=1152921504
authorizer.class.name=org.apache.kafka.metadata.authorizer.StandardAuthorizer
ssl.truststore.credentials=kafka_truststore_creds
log.retention.hours=168
broker.id=1
early.start.listeners=CONTROLLER
broker.rack=dev-Testing
ssl.keystore.password=XXXXXXX
min.insync.replicas=1
transaction.state.log.replication.factor=1
listeners=SASL_SSL://:9094,SSL://:9093,CONTROLLER://:29093
ssl.keystore.location=/etc/kafka/secrets/kafka.keystore.jks
zookeeper.connect=
sasl.mechanism.inter.broker.protocol=PLAIN
ssl.truststore.location=/etc/kafka/secrets/kafka.truststore.jks
ssl.endpoint.identification.algorithm=
ssl.truststore.type=JKS
log.dirs=/var/lib/kafka/data
offsets.topic.replication.factor=1
allow.everyone.if.no.acl.found=true
ssl.client.auth=required
ssl.keystore.credentials=kafka_keystore_creds
Any suggestions or insights would be greatly appreciated!