java.io.IOException: Packet len 9592748 is out of range! (zk- broker not communicating)

Hi Team,

We recently encountered an issue where there was a communication gap between the Kafka brokers and Zookeeper, leading to aborted internal operations.

Broker Log Observed:
Image: Image: confluentinc/cp-kafka:7.8.1-1-ubi8
java.io.IOException: Packet len 9592748 is out of range!

Zookeeper Log Observed:
Image: confluentinc/cp-zookeeper:7.8.1-1-ubi8
java.io.IOException: Broken pipe

This indicated a failure in communication due to exceeding the buffer size.
Upon investigation, we identified that the default buffer size (jute.maxbuffer) in Zookeeper was insufficient for the size of packets being exchanged.

We have temporarily mitigated the issue by adding the following parameter to the Zookeeper configuration:
-Djute.maxbuffer=49107800
Post this change, the communication between brokers and Zookeeper resumed normally and I/O operations are functioning.

and, can you please provide what is the default value for confluent kafka? since we are using confluent Images.

Following are logs:
Zookeerper:
WARN Close of session 0x30876fc0a100002 (org.apache.zookeeper.server.NIOServerCnxn)
java.io.IOException: Broken pipe
at java.base/sun.nio.ch.FileDispatcherImpl.write0(Native Method)
at java.base/sun.nio.ch.SocketDispatcher.write(Unknown Source)
at java.base/sun.nio.ch.IOUtil.writeFromNativeBuffer(Unknown Source)
at java.base/sun.nio.ch.IOUtil.write(Unknown Source)
at java.base/sun.nio.ch.IOUtil.write(Unknown Source)
at java.base/sun.nio.ch.SocketChannelImpl.write(Unknown Source)
at org.apache.zookeeper.server.NIOServerCnxn.handleWrite(NIOServerCnxn.java:289)
at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:366)
at org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:508)
at org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:153)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)

Broker:
WARN Session 0x30876fc0a100003 for server kafka-east-uat-cp-zookeeper-headless/<{IP}>:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException. (org.apache.zookeeper.ClientCnxn)
java.io.IOException: Packet len 9592748 is out of range!
at org.apache.zookeeper.ClientCnxnSocket.readLength(ClientCnxnSocket.java:121)
at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:84)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1289)
[2025-04-03 04:38:50,620] WARN [GroupCoordinator 0]: Failed to write empty metadata for group : The group is rebalancing, so a rejoin is needed. (kafka.coordinator.group.GroupCoordinator)

can you please help us investigate what caused the sudden increase in packet size which led to exceeding the Zookeeper buffer limit? Understanding the root cause will help us apply a more permanent fix and avoid such issues in the future.

hey @SaiKrishnaNeeli

could you share some config and details about your setup?
is this docker based?

why did you start with zookeeper? recommended approach would be to switch to kraft instead of zookeeper.

Thanks for replying back @mmuehlbeyer
We are using default configurations for both broker and zookeeper.

Yes, we’re currently using Docker-based Confluent Platform images version for deploying in kubernetes using Helm charts.

We have 5 brokers and 3 zookeeper pods.
Images:
confluentinc/cp-zookeeper:7.8.1-1-ubi8
confluentinc/cp-kafka:7.8.1-1-ubi8

As part of our roadmap, we’re planning to migrate to KRaft mode from ZK mode with Apache Kafka 4.0 in the upcoming release. At the moment, we’re actively testing the transition from ZooKeeper mode to KRaft mode in our development environment, ensuring a seamless migration without any data loss.

I see are you deploying with Confluent for Kubernetes?

and could you check the topic.config.sync.interval.ms parameter?

We’re not using the CFK (Confluent for Kubernetes) operator. Instead, we deploy Confluent Platform components—Kafka, ZooKeeper, Schema Registry, Kafka REST, KSQL, and MirrorMaker 2—directly using Helm with custom configurations. For cross-cluster topic replication, we’re not using Kafka Replicator; rather, we rely on MirrorMaker 2 with the following configuration.

“sync.topic.configs.enabled”: “true”,
“sync.topic.configs.interval.seconds”: 60,

ok I see
hmm need to dig around a bit

Thanks for the update @mmuehlbeyer Please let me know if you need any further information or details from my side.