On a 3 node broker cluster, one node is always down

Using docker compose, I am trying to deploy a 3 node kafka cluster. Randomly 2 nodes are working fine and 3rd node is always getting the following error:

[2023-11-03 04:12:59,865] INFO [broker-1-to-controller-forwarding-channel-manager]: Starting (kafka.server.BrokerToControllerRequestThread)
[2023-11-03 04:12:59,959] INFO [MetadataLoader id=1] initializeNewPublishers: the loader is still catching up because we still don't know the high water mark yet. (org.apache.kafka.image.loader.MetadataLoader)
[2023-11-03 04:13:00,026] INFO [RaftManager id=1] Registered the listener org.apache.kafka.image.loader.MetadataLoader@351741309 (org.apache.kafka.raft.KafkaRaftClient)
[2023-11-03 04:13:00,061] INFO [MetadataLoader id=1] initializeNewPublishers: the loader is still catching up because we still don't know the high water mark yet. (org.apache.kafka.image.loader.MetadataLoader)
[2023-11-03 04:13:00,162] INFO [MetadataLoader id=1] initializeNewPublishers: the loader is still catching up because we still don't know the high water mark yet. (org.apache.kafka.image.loader.MetadataLoader)
[2023-11-03 04:13:00,218] ERROR Encountered fatal fault: Unexpected error in raft IO thread (org.apache.kafka.server.fault.ProcessTerminatingFaultHandler)
java.lang.IllegalStateException: Received request or response with leader OptionalInt[1] and epoch 18 which is inconsistent with current leader OptionalInt.empty and epoch 0
	at org.apache.kafka.raft.KafkaRaftClient.maybeTransition(KafkaRaftClient.java:1513)
	at org.apache.kafka.raft.KafkaRaftClient.maybeHandleCommonResponse(KafkaRaftClient.java:1473)
	at org.apache.kafka.raft.KafkaRaftClient.handleFetchResponse(KafkaRaftClient.java:1071)
	at org.apache.kafka.raft.KafkaRaftClient.handleResponse(KafkaRaftClient.java:1550)
	at org.apache.kafka.raft.KafkaRaftClient.handleInboundMessage(KafkaRaftClient.java:1676)
	at org.apache.kafka.raft.KafkaRaftClient.poll(KafkaRaftClient.java:2251)
	at kafka.raft.KafkaRaftManager$RaftIoThread.doWork(RaftManager.scala:64)
	at org.apache.kafka.server.util.ShutdownableThread.run(ShutdownableThread.java:127)

I have 3 dedicated controller nodes running separately. Here is the broker config that I am using for node1 which is down currently:

---
version: '2'
services:

  broker:
    image: confluentinc/cp-kafka:7.5.1
    hostname: vskafka-broker-1
    container_name: kafka-broker-1
    ports:
      - "9092:9092"
      - "29092:29092"
    environment:
      KAFKA_NODE_ID: 1
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: 'CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT'
      KAFKA_ADVERTISED_LISTENERS: 'PLAINTEXT://vskafka-broker-1:29092'
      KAFKA_PROCESS_ROLES: 'broker'
      KAFKA_CONTROLLER_QUORUM_VOTERS: '2@vskafka-controller-2:9093,3@vskafka-controller-3:9093'
      KAFKA_LISTENERS: 'PLAINTEXT://vskafka-broker-1:9092'
      KAFKA_INTER_BROKER_LISTENER_NAME: 'PLAINTEXT'
      KAFKA_CONTROLLER_LISTENER_NAMES: 'CONTROLLER'
      KAFKA_LOG_DIRS: '/tmp/kraft-broker-logs'
      CLUSTER_ID: 'mX-qLvc-T2y2OPeJ3AMRXg'

This node can reach to the mentioned controllers without any issue using telnet. And also can talk to the other broker nodes.

hey @Udayendu

hmm looks strange
could you share your complete docker-compose?

Hi @mmuehlbeyer

I have 3 controller nodes and 3 broker nodes.
I am using the same cluster ID for all the 6 nodes. All the nodes have their own docker-compose file.

I am doing the following steps:

  • deployed all the controller nodes first
  • then started the deployment of broker nodes. some time randomly node2 is not working and some time node1. Out of 3 nodes only two nodes are showing as working and 3rd is not.
  • for broker node1, controller quorum voters are controller node2 and 3. And like wise its repeated for all the 3 broker nodes.

Here is the config that I am using for now to configure brokers:

---
version: '2'
services:

  broker:
    image: confluentinc/cp-kafka:7.5.1
    hostname: vskafka-broker-1
    container_name: kafka-broker-1
    ports:
      - "9092:9092"
      - "29092:29092"
    environment:
      KAFKA_NODE_ID: 1
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: 'CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT'
      KAFKA_ADVERTISED_LISTENERS: 'PLAINTEXT://vskafka-broker-1:29092'
      KAFKA_PROCESS_ROLES: 'broker'
      KAFKA_CONTROLLER_QUORUM_VOTERS: '2@vskafka-controller-2:9093,3@vskafka-controller-3:9093'
      KAFKA_LISTENERS: 'PLAINTEXT://vskafka-broker-1:9092'
      KAFKA_INTER_BROKER_LISTENER_NAME: 'PLAINTEXT'
      KAFKA_CONTROLLER_LISTENER_NAMES: 'CONTROLLER'
      KAFKA_LOG_DIRS: '/tmp/kraft-broker-logs'
      CLUSTER_ID: 'mX-qLvc-T2y2OPeJ3AMRXg'  

---
version: '2'
services:

  broker:
    image: confluentinc/cp-kafka:7.5.1
    hostname: vskafka-broker-2
    container_name: kafka-broker-2
    ports:
      - "9092:9092"
      - "29092:29092"
    environment:
      KAFKA_NODE_ID: 2
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: 'CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT'
      KAFKA_ADVERTISED_LISTENERS: 'PLAINTEXT://vskafka-broker-2:29092'
      KAFKA_PROCESS_ROLES: 'broker'
      KAFKA_CONTROLLER_QUORUM_VOTERS: '1@vskafka-controller-1:9093,3@vskafka-controller-3:9093'
      KAFKA_LISTENERS: 'PLAINTEXT://vskafka-broker-2:9092'
      KAFKA_INTER_BROKER_LISTENER_NAME: 'PLAINTEXT'
      KAFKA_CONTROLLER_LISTENER_NAMES: 'CONTROLLER'
      KAFKA_LOG_DIRS: '/tmp/kraft-broker-logs'
      CLUSTER_ID: 'mX-qLvc-T2y2OPeJ3AMRXg'

---
version: '2'
services:

  broker:
    image: confluentinc/cp-kafka:7.5.1
    hostname: vskafka-broker-3
    container_name: kafka-broker-3
    ports:
      - "9092:9092"
      - "29092:29092"
    environment:
      KAFKA_NODE_ID: 3
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: 'CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT'
      KAFKA_ADVERTISED_LISTENERS: 'PLAINTEXT://vskafka-broker-3:29092'
      KAFKA_PROCESS_ROLES: 'broker'
      KAFKA_CONTROLLER_QUORUM_VOTERS: '1@vskafka-controller-1:9093,2@vskafka-controller-2:9093'
      KAFKA_LISTENERS: 'PLAINTEXT://vskafka-broker-3:9092'
      KAFKA_INTER_BROKER_LISTENER_NAME: 'PLAINTEXT'
      KAFKA_CONTROLLER_LISTENER_NAMES: 'CONTROLLER'
      KAFKA_LOG_DIRS: '/tmp/kraft-broker-logs'
      CLUSTER_ID: 'mX-qLvc-T2y2OPeJ3AMRXg'

And my controller configs are:

---
version: '2'
services:

  controller:
    image: confluentinc/cp-kafka:7.5.1
    hostname: vskafka-controller-1
    container_name: kafka-controller-1
    ports:
      - "9093:9093"
    environment:
      KAFKA_NODE_ID: 1
      KAFKA_PROCESS_ROLES: 'controller'
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: 'CONTROLLER:PLAINTEXT'
      KAFKA_CONTROLLER_QUORUM_VOTERS: '1@vskafka-controller-1:9093,2@vskafka-controller-2:9093,3@vskafka-controller-3:9093'
      KAFKA_LISTENERS: 'CONTROLLER://vskafka-controller-1:9093'
      KAFKA_CONTROLLER_LISTENER_NAMES: 'CONTROLLER'
      KAFKA_LOG_DIRS: '/tmp/kraft-controller-logs'
      CLUSTER_ID: 'mX-qLvc-T2y2OPeJ3AMRXg'

---
version: '2'
services:

  controller:
    image: confluentinc/cp-kafka:7.5.1
    hostname: vskafka-controller-2
    container_name: kafka-controller-2
    ports:
      - "9093:9093"
    environment:
      KAFKA_NODE_ID: 2
      KAFKA_PROCESS_ROLES: 'controller'
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: 'CONTROLLER:PLAINTEXT'
      KAFKA_CONTROLLER_QUORUM_VOTERS: '1@vskafka-controller-1:9093,2@vskafka-controller-2:9093,3@vskafka-controller-3:9093'
      KAFKA_LISTENERS: 'CONTROLLER://vskafka-controller-2:9093'
      KAFKA_CONTROLLER_LISTENER_NAMES: 'CONTROLLER'
      KAFKA_LOG_DIRS: '/tmp/kraft-controller-logs'
      CLUSTER_ID: 'mX-qLvc-T2y2OPeJ3AMRXg'

---
version: '2'
services:

  controller:
    image: confluentinc/cp-kafka:7.5.1
    hostname: vskafka-controller-3
    container_name: kafka-controller-3
    ports:
      - "9093:9093"
    environment:
      KAFKA_NODE_ID: 3
      KAFKA_PROCESS_ROLES: 'controller'
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: 'CONTROLLER:PLAINTEXT'
      KAFKA_CONTROLLER_QUORUM_VOTERS: '1@vskafka-controller-1:9093,2@vskafka-controller-2:9093,3@vskafka-controller-3:9093'
      KAFKA_LISTENERS: 'CONTROLLER://vskafka-controller-3:9093'
      KAFKA_CONTROLLER_LISTENER_NAMES: 'CONTROLLER'
      KAFKA_LOG_DIRS: '/tmp/kraft-controller-logs'
      CLUSTER_ID: 'mX-qLvc-T2y2OPeJ3AMRXg'

Let me know if you need further info from my side.

ok understood
so 3 separate nodes with a controller and a broker on each of the nodes correct?

and why did you not specify all 3 nodes in
KAFKA_CONTROLLER_QUORUM_VOTERS: '1@vskafka-controller-1:9093,2@vskafka-controller-2:9093'

yes, all are on separate nodes.

When I am specifying all the 3 controller nodes, its complaining with the following error:

# docker logs kafka-broker-1
===> User
uid=1000(appuser) gid=1000(appuser) groups=1000(appuser)
===> Configuring ...
Running in KRaft mode...
===> Running preflight checks ...
===> Check if /var/lib/kafka/data is writable ...
===> Running in KRaft mode, skipping Zookeeper health check...
===> Using provided cluster id mX-qLvc-T2y2OPeJ3AMRXg ...
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: If process.roles contains just the 'broker' role, the node id 1 must not be included in the set of voters controller.quorum.voters=Set(1, 2, 3) at scala.Predef$.require(Predef.scala:337) at kafka.server.KafkaConfig.validateValues(KafkaConfig.scala:2246) at kafka.server.KafkaConfig.<init>(KafkaConfig.scala:2160) at kafka.server.KafkaConfig.<init>(KafkaConfig.scala:1568) at kafka.tools.StorageTool$.$anonfun$main$1(StorageTool.scala:50) at scala.Option.flatMap(Option.scala:283) at kafka.tools.StorageTool$.main(StorageTool.scala:50) at kafka.tools.StorageTool.main(StorageTool.scala)

Let say if I am deploying broker1, the I cant add it for controller 1. Same logic is applicable for all the remaining two nodes as well.

To fix this issue, I changed the ID of all brokers to 101, 102 and 103 and for all controllers 1, 2 and 3. Then added all the controllers to the quorum list for brokers and started the services like:

KAFKA_CONTROLLER_QUORUM_VOTERS: '1@vskafka-controller-1:9093,2@vskafka-controller-2:9093,3@vskafka-controller-3:9093'

Its started working well now.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.