Kafka Timeout Issues on 1 Node KRaft Cluster in Combined Mode

sgangu · 13 February 2025 17:34

Hello,

We are experiencing timeout issues in our staging Kafka 1-node cluster running in combined mode. The cluster is facing multiple disconnections and timeout errors, particularly with the BROKER_HEARTBEAT requests.

Here are the logs from the Broker:

[2025-01-12 06:41:19,297] INFO [NodeToControllerChannelManager id=1 name=heartbeat] Cancelled in-flight BROKER_HEARTBEAT request with correlation id 237778 due to node 1 being disconnected (elapsed time since creation: 4505ms, elapsed time since send: 4505ms, request timeout: 4500ms) (org.apache.kafka.clients.NetworkClient)
[2025-01-12 06:41:23,900] INFO [NodeToControllerChannelManager id=1 name=heartbeat] Disconnecting from node 1 due to request timeout. (org.apache.kafka.clients.NetworkClient)
[2025-01-12 06:41:23,901] INFO [NodeToControllerChannelManager id=1 name=heartbeat] Cancelled in-flight BROKER_HEARTBEAT request with correlation id 237779 due to node 1 being disconnected (elapsed time since creation: 4503ms, elapsed time since send: 4502ms, request timeout: 4500ms) (org.apache.kafka.clients.NetworkClient)
[2025-01-12 06:41:31,223] INFO [NodeToControllerChannelManager id=1 name=heartbeat] Disconnecting from node 1 due to request timeout. (org.apache.kafka.clients.NetworkClient)
[2025-01-12 06:41:31,223] INFO [NodeToControllerChannelManager id=1 name=heartbeat] Cancelled in-flight BROKER_HEARTBEAT request with correlation id 237783 due to node 1 being disconnected (elapsed time since creation: 4505ms, elapsed time since send: 4505ms, request timeout: 4500ms) (org.apache.kafka.clients.NetworkClient)
[2025-01-12 06:41:53,014] INFO [NodeToControllerChannelManager id=1 name=heartbeat] Disconnecting from node 1 due to request timeout. (org.apache.kafka.clients.NetworkClient)
[2025-01-12 06:41:53,014] INFO [NodeToControllerChannelManager id=1 name=heartbeat] Cancelled in-flight BROKER_HEARTBEAT request with correlation id 237789 due to node 1 being disconnected (elapsed time since creation: 4503ms, elapsed time since send: 4503ms, request timeout: 4500ms) (org.apache.kafka.clients.NetworkClient)

I came across a post about KAFKA-14392 which mentions that KRaft brokers maintain their liveness in the cluster by sending BROKER_HEARTBEAT requests to the active controller. If a broker does not send the heartbeat within the period defined by broker.session.timeout.ms, the active controller fences the broker.

I have already updated the timeout values in the configuration, but the issue still persists. The current settings are:

controller.quorum.request.timeout.ms=5000
controller.quorum.election.timeout.ms=3000
broker.session.timeout.ms=9000
controller.socket.timeout.ms=30000
controller.quorum.fetch.timeout.ms=5000
request.timeout.ms=40000
connections.max.idle.ms=540000
rebalance.timeout.ms=60000
broker.heartbeat.interval.ms=5000
worker.sync.timeout.ms=3000
worker.unsync.backoff.ms=300000
reconnect.backoff.max.ms=1000

I also found an issue in another post regarding quorum voters in KRaft clusters. They suggested that the controller.quorum.voters configuration should be reviewed and correctly configured in the format {id}@{host}:{port}. I verified this in the documentation, and for a cluster with three nodes, it should look like:

KAFKA_CONTROLLER_QUORUM_VOTERS=1@broker01.abc.net:29093,2@broker02.abc.net:29093,3@broker03.abc.net:29093

In my case, I only have one node running, so I updated the server.properties configuration and restarted the Kafka container. The key change was increasing request.timeout.ms from 40000 to 60000. Here is the configuration I updated:

request.timeout.ms=60000  // Increased from 40000

Additionally, I ran a Python script to produce and consume messages from the test-topic connected to the broker. The broker stopped after several requests, causing the controller to wait for the broker to restart, as there was no other node to switch to as the leader.

The broker logs show several BROKER_HEARTBEAT messages being cancelled when the broker is stopped and restarted, indicating that the issue might be related to the request and session timeouts when the broker is temporarily unavailable.

Here is the message I found in the broker logs:

Cancelled in-flight BROKER_HEARTBEAT request with correlation id 2130 due to node 1

Finally, I noticed the replication factor for the offsets.topic and transaction.state.log topics was set to 1. To test if increasing the replication factor helps, I changed these values to 3 in the configuration:

offsets.topic.replication.factor=3
transaction.state.log.replication.factor=3

server.properties:

sasl.enabled.mechanisms=PLAIN

sasl.mechanism.inter.broker.protocol=SASL_SSL

sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required username="XXXXXX" password="YYYYYY";

super.users=User:admin

listeners=SASL_SSL://:9094,SSL://:9093,CONTROLLER://:29093

advertised.listeners=SASL_SSL://broker01.abc.net:9094,SSL://broker01.abc.net:9093

spring.kafka.bootstrap-servers=broker01.abc.net:9094

controller.listener.names=CONTROLLER

listener.security.protocol.map=CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT,SSL:SSL,SASL_PLAINTEXT:SASL_PLAINTEXT,SASL_SSL:SASL_SSL

ssl.client.auth=required

ssl.key.password=XXXXXXX

ssl.keystore.location=/etc/kafka/secrets/kafka.keystore.jks

ssl.keystore.password=XXXXXX

ssl.truststore.location=/etc/kafka/secrets/kafka.truststore.jks

ssl.truststore.password=XXXXXX

offsets.topic.replication.factor=3

transaction.state.log.replication.factor=3

transaction.state.log.min.isr=1

group.initial.rebalance.delay.ms=0

process.roles=broker,controller

log.retention.hours=168

log.segment.bytes=1073741824

log.retention.check.interval.ms=300000

num.network.threads=3

num.io.threads=8

queued.max.requests=500

replica.fetch.max.bytes=1048576

replica.fetch.wait.max.ms=500

replica.selector.class=kafka.server.SimplePartitioner

ssl.endpoint.identification.algorithm=

controller.quorum.request.timeout.ms=5000

controller.quorum.election.timeout.ms=3000

broker.session.timeout.ms=9000

controller.socket.timeout.ms=30000

controller.quorum.fetch.timeout.ms=5000

request.timeout.ms=60000

connections.max.idle.ms=540000

rebalance.timeout.ms=60000

broker.heartbeat.interval.ms=10000

worker.sync.timeout.ms=3000

worker.unsync.backoff.ms=300000

reconnect.backoff.max.ms=1000

kafka.properties

replica.fetch.max.bytes=1152921504

ssl.keystore.filename=kafka.keystore.jks

super.users=User:admin

default.replication.factor=1

transaction.state.log.min.isr=1

ssl.key.credentials=kafka_ssl_key_creds

process.roles=broker,controller

security.inter.broker.protocol=SASL_SSL

controller.listener.names=CONTROLLER

controller.quorum.voters=1@broker01.abc.net:29093

message.max.bytes=1152921504

auto.create.topics.enable=false

ssl.key.password=XXXXXX

ssl.truststore.password=XXXXXX

ssl.keystore.type=JKS

log.retention.ms=604800000

advertised.listeners=SASL_SSL://broker01.abc.net:9094,SSL://broker01.abc.net:9093

sasl.enabled.mechanisms=PLAIN

kraft.mode=true

log4j.logger.org.apache.kafka=DEBUG

listener.security.protocol.map=CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT,SSL:SSL,SASL_PLAINTEXT:SASL_PLAINTEXT,SASL_SSL:SASL_SSL

ssl.truststore.filename=kafka.truststore.jks

fetch.message.max.bytes=1152921504

authorizer.class.name=org.apache.kafka.metadata.authorizer.StandardAuthorizer

ssl.truststore.credentials=kafka_truststore_creds

log.retention.hours=168

broker.id=1

early.start.listeners=CONTROLLER

broker.rack=dev-Testing

ssl.keystore.password=XXXXXXX

min.insync.replicas=1

transaction.state.log.replication.factor=1

listeners=SASL_SSL://:9094,SSL://:9093,CONTROLLER://:29093

ssl.keystore.location=/etc/kafka/secrets/kafka.keystore.jks

zookeeper.connect=

sasl.mechanism.inter.broker.protocol=PLAIN

ssl.truststore.location=/etc/kafka/secrets/kafka.truststore.jks

ssl.endpoint.identification.algorithm=

ssl.truststore.type=JKS

log.dirs=/var/lib/kafka/data

offsets.topic.replication.factor=1

allow.everyone.if.no.acl.found=true

ssl.client.auth=required

ssl.keystore.credentials=kafka_keystore_creds

Any suggestions or insights would be greatly appreciated!

Topic		Replies	Views
Disconnecting from node 1 due to request timeout Kafka Connect	1	1018	18 October 2024
Kafka cluster collapsed for no reason Ops	0	87	30 November 2024
KAFKA cluster with kraft Kafka Streams	5	10341	31 August 2023
Can kafka-3.3.1 and kafka-0.10.0.1 coexist on one node? Cluster Replication	26	5238	4 November 2022
Kafka (KRaft) inter-broker controller DNS failure Ops	0	2789	17 July 2023

Kafka Timeout Issues on 1 Node KRaft Cluster in Combined Mode

Related topics