Brokers failing to connect to zookeeper

Hello All,
We are facing one issue when we restart both zookeepers and brokers at same time.
We have a 3 brokers pods and 3 zookeeper pods running on kubernetes environment and when there is any activity on kubernetes environment or when we deploy helm chart after making any changes that causes zookeeper and brokers to recreate then we got this issue.
In logs we can see that broker is trying to connect using zookeepers headless service but timing out.
Then again we have to manually restart brokers and zookeepers to connect again.

This is causing downtime whenever we are deploying new release. Is there any option to deploy without manually restarting pods?

Hi @naveen8384

could provide some details about you env?
I assume you’re using a helm chart to start the stack? if yes could you provide some details about the helm chart?

best,
michael

Hi @mmuehlbeyer ,
we are using confluent helm charts, which are using zookeeper image version 5.5.3 and kafka image version 5.5.3.

this are the logs we got in broker pod

+ export KAFKA_BROKER_ID=0
+ exec /etc/confluent/docker/run
===> User
uid=0(root) gid=0(root) groups=0(root)
===> Configuring ...
===> Running preflight checks ...
===> Check if /var/lib/kafka/data is writable ...
===> Check if Zookeeper is healthy ...
[main] INFO org.apache.zookeeper.ZooKeeper - Client environment:zookeeper.version=3.5.8-f439ca583e70862c3068a1f2a7d4d068eec33315, built on 05/04/2020 15:53 GMT
[main] INFO org.apache.zookeeper.ZooKeeper - Client environment:host.name=kafka-cp-kafka-0.kafka-cp-kafka-headless.kafka.svc.cluster.local
[main] INFO org.apache.zookeeper.ZooKeeper - Client environment:java.version=1.8.0_222
[main] INFO org.apache.zookeeper.ZooKeeper - Client environment:java.vendor=Azul Systems, Inc.
[main] INFO org.apache.zookeeper.ZooKeeper - Client environment:java.home=/usr/lib/jvm/zulu-8-amd64/jre
[main] INFO org.apache.zookeeper.ZooKeeper - Client environment:java.class.path=/etc/confluent/docker/docker-utils.jar
[main] INFO org.apache.zookeeper.ZooKeeper - Client environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
[main] INFO org.apache.zookeeper.ZooKeeper - Client environment:java.io.tmpdir=/tmp
[main] INFO org.apache.zookeeper.ZooKeeper - Client environment:java.compiler=<NA>
[main] INFO org.apache.zookeeper.ZooKeeper - Client environment:os.name=Linux
[main] INFO org.apache.zookeeper.ZooKeeper - Client environment:os.arch=amd64
[main] INFO org.apache.zookeeper.ZooKeeper - Client environment:os.version=5.4.148-1.el7.elrepo.x86_64
[main] INFO org.apache.zookeeper.ZooKeeper - Client environment:user.name=root
[main] INFO org.apache.zookeeper.ZooKeeper - Client environment:user.home=/root
[main] INFO org.apache.zookeeper.ZooKeeper - Client environment:user.dir=/
[main] INFO org.apache.zookeeper.ZooKeeper - Client environment:os.memory.free=150MB
[main] INFO org.apache.zookeeper.ZooKeeper - Client environment:os.memory.max=2276MB
[main] INFO org.apache.zookeeper.ZooKeeper - Client environment:os.memory.total=153MB
[main] INFO org.apache.zookeeper.ZooKeeper - Initiating client connection, connectString=kafka-cp-zookeeper-headless:2181 sessionTimeout=40000 watcher=io.confluent.admi
n.utils.ZookeeperConnectionWatcher@65b3120a
[main] INFO org.apache.zookeeper.common.X509Util - Setting -D jdk.tls.rejectClientInitiatedRenegotiation=true to disable client-initiated TLS renegotiation
[main] INFO org.apache.zookeeper.ClientCnxnSocket - jute.maxbuffer value is 4194304 Bytes
[main] INFO org.apache.zookeeper.ClientCnxn - zookeeper.request.timeout value is 0. feature enabled=
[main-SendThread(kafka-cp-zookeeper-headless:2181)] INFO org.apache.zookeeper.ClientCnxn - Opening socket connection to server kafka-cp-zookeeper-headless/<ip>
:2181. Will not attempt to authenticate using SASL (unknown error)
[main-SendThread(kafka-cp-zookeeper-headless:2181)] INFO org.apache.zookeeper.ClientCnxn - Socket connection established, initiating session, client: /<ip>:3978
8, server: kafka-cp-zookeeper-headless/<ip>:2181
[main] ERROR io.confluent.admin.utils.ClusterStatus - Timed out waiting for connection to Zookeeper server [kafka-cp-zookeeper-headless:2181].
[main-SendThread(kafka-cp-zookeeper-headless:2181)] WARN org.apache.zookeeper.ClientCnxn - Client session timed out, have not heard from server in 40000ms for sessionid
 0x0
[main] INFO org.apache.zookeeper.ZooKeeper - Session: 0x0 closed
[main-EventThread] INFO org.apache.zookeeper.ClientCnxn - EventThread shut down for session: 0x0

hi @naveen8384

thanks for the information.
regarding

could you provide an example what you’re doing?

best,
michael

Any changes which causes broker and zookeeper pods to recreate, like changing resource requests or limits for broker and zookeeper pods.

image

ok I see

are you changing config like this?

helm upgrade <release name> cp-helm-charts

best,
michael

Yes, this is how we are deploying changes.

ok did you check if the uri/servicename of the headless service changes?

No, headless service name is same. only the existing pods deleted and new pods created.
If we scale down brokers, zookeepers and scale up again then everything is working.

strange
there is a similar issue listed here

according to the docs confluent 5.5.3 ships with zookeeper 3.5.7 and the issue talks about zookeeper version 3.5.8

so you might have hit the bug

might be worth trying to test with confluent 6.x

btw: any special reasions for using these version?

best,
michael

Hi @mmuehlbeyer ,
Thanks for sharing the issue reference. I think this is the same issue which we are facing.
There is no specific reason for using version 5.5.3. we have started with that version and still using the same version.
I have few questions regarding upgrade.
Can we update only the zookeeper version to 6.1 version from 3.5.8 keeping brokers version same?
Is version 6.1 compatible with 3.5.8 data, we don’t want to lose existing data?

hi @naveen8384

according to the docs
confluent 7.1 also ships with zookeeper 3.5.x
so I assume it should be working, but never tested by myself.

https://docs.confluent.io/platform/7.1.1/installation/upgrade.html#upgrade-zk

best,
michael

Hi @mmuehlbeyer ,
I tried to upgrade zookeeper version to 6.2.2 but we are not able to bring zookeepers up. In logs we got these errors.

===> User
uid=1000(appuser) gid=1000(appuser) groups=1000(appuser)
===> Configuring ...
[Errno 13] Permission denied: '/var/lib/zookeeper/data/myid'
Command [/usr/local/bin/dub template /etc/confluent/docker/myid.template /var/lib/zookeeper/data/myid] FAILED !

2022-03-16 08:55:12,639] ERROR Disk error while locking directory /opt/kafka/data-0/logs (kafka.server.LogDirFailureChannel)
java.nio.file.AccessDeniedException: /opt/kafka/data-0/logs/.lock
[2022-03-16 08:55:14,971] INFO Loading logs from log dirs ArraySeq(/opt/kafka/data-0/logs) (kafka.log.LogManager)
[2022-03-16 08:55:14,975] INFO Skipping recovery for all logs in /opt/kafka/data-0/logs since clean shutdown file was found (kafka.log.LogManager)
[2022-03-16 08:55:14,979] ERROR Error while loading log dir /opt/kafka/data-0/logs (kafka.log.LogManager)
java.nio.file.AccessDeniedException: /opt/kafka/data-0/logs/.kafka_cleanshutdown
at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:90)

[2022-03-16 08:55:14,981] ERROR Error while loading log dir /opt/kafka/data-0/logs (kafka.server.LogDirFailureChannel)
java.nio.file.AccessDeniedException: /opt/kafka/data-0/logs/.kafka_cleanshutdown

While analyzinglogs we found, old image 5.5.3 is using root user while new image is using appuser user. so we are not able to get old data.

===> User
uid=0(root) gid=0(root) groups=0(root)
===> Configuring ...
===> Running preflight checks ...
===> Check if /var/lib/zookeeper/data is writable ...
===> Check if /var/lib/zookeeper/log is writable ...
===> Launching ...
===> Printing /var/lib/zookeeper/data/myid
1===> Launching zookeeper ...

can you please suggest a way to use existing data with new image?

Thanks
Naveen

Hi @naveen8384

did you try to change security context in your values.yml

securityContext: {}
  #  runAsUser: 1000
  #  runAsGroup: 1000

best,
michael