Hello All,
We are facing one issue when we restart both zookeepers and brokers at same time.
We have a 3 brokers pods and 3 zookeeper pods running on kubernetes environment and when there is any activity on kubernetes environment or when we deploy helm chart after making any changes that causes zookeeper and brokers to recreate then we got this issue.
In logs we can see that broker is trying to connect using zookeepers headless service but timing out.
Then again we have to manually restart brokers and zookeepers to connect again.
This is causing downtime whenever we are deploying new release. Is there any option to deploy without manually restarting pods?
could provide some details about you env?
I assume you’re using a helm chart to start the stack? if yes could you provide some details about the helm chart?
Hi @mmuehlbeyer ,
we are using confluent helm charts, which are using zookeeper image version 5.5.3 and kafka image version 5.5.3.
this are the logs we got in broker pod
+ export KAFKA_BROKER_ID=0
+ exec /etc/confluent/docker/run
===> User
uid=0(root) gid=0(root) groups=0(root)
===> Configuring ...
===> Running preflight checks ...
===> Check if /var/lib/kafka/data is writable ...
===> Check if Zookeeper is healthy ...
[main] INFO org.apache.zookeeper.ZooKeeper - Client environment:zookeeper.version=3.5.8-f439ca583e70862c3068a1f2a7d4d068eec33315, built on 05/04/2020 15:53 GMT
[main] INFO org.apache.zookeeper.ZooKeeper - Client environment:host.name=kafka-cp-kafka-0.kafka-cp-kafka-headless.kafka.svc.cluster.local
[main] INFO org.apache.zookeeper.ZooKeeper - Client environment:java.version=1.8.0_222
[main] INFO org.apache.zookeeper.ZooKeeper - Client environment:java.vendor=Azul Systems, Inc.
[main] INFO org.apache.zookeeper.ZooKeeper - Client environment:java.home=/usr/lib/jvm/zulu-8-amd64/jre
[main] INFO org.apache.zookeeper.ZooKeeper - Client environment:java.class.path=/etc/confluent/docker/docker-utils.jar
[main] INFO org.apache.zookeeper.ZooKeeper - Client environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
[main] INFO org.apache.zookeeper.ZooKeeper - Client environment:java.io.tmpdir=/tmp
[main] INFO org.apache.zookeeper.ZooKeeper - Client environment:java.compiler=<NA>
[main] INFO org.apache.zookeeper.ZooKeeper - Client environment:os.name=Linux
[main] INFO org.apache.zookeeper.ZooKeeper - Client environment:os.arch=amd64
[main] INFO org.apache.zookeeper.ZooKeeper - Client environment:os.version=5.4.148-1.el7.elrepo.x86_64
[main] INFO org.apache.zookeeper.ZooKeeper - Client environment:user.name=root
[main] INFO org.apache.zookeeper.ZooKeeper - Client environment:user.home=/root
[main] INFO org.apache.zookeeper.ZooKeeper - Client environment:user.dir=/
[main] INFO org.apache.zookeeper.ZooKeeper - Client environment:os.memory.free=150MB
[main] INFO org.apache.zookeeper.ZooKeeper - Client environment:os.memory.max=2276MB
[main] INFO org.apache.zookeeper.ZooKeeper - Client environment:os.memory.total=153MB
[main] INFO org.apache.zookeeper.ZooKeeper - Initiating client connection, connectString=kafka-cp-zookeeper-headless:2181 sessionTimeout=40000 watcher=io.confluent.admi
n.utils.ZookeeperConnectionWatcher@65b3120a
[main] INFO org.apache.zookeeper.common.X509Util - Setting -D jdk.tls.rejectClientInitiatedRenegotiation=true to disable client-initiated TLS renegotiation
[main] INFO org.apache.zookeeper.ClientCnxnSocket - jute.maxbuffer value is 4194304 Bytes
[main] INFO org.apache.zookeeper.ClientCnxn - zookeeper.request.timeout value is 0. feature enabled=
[main-SendThread(kafka-cp-zookeeper-headless:2181)] INFO org.apache.zookeeper.ClientCnxn - Opening socket connection to server kafka-cp-zookeeper-headless/<ip>
:2181. Will not attempt to authenticate using SASL (unknown error)
[main-SendThread(kafka-cp-zookeeper-headless:2181)] INFO org.apache.zookeeper.ClientCnxn - Socket connection established, initiating session, client: /<ip>:3978
8, server: kafka-cp-zookeeper-headless/<ip>:2181
[main] ERROR io.confluent.admin.utils.ClusterStatus - Timed out waiting for connection to Zookeeper server [kafka-cp-zookeeper-headless:2181].
[main-SendThread(kafka-cp-zookeeper-headless:2181)] WARN org.apache.zookeeper.ClientCnxn - Client session timed out, have not heard from server in 40000ms for sessionid
0x0
[main] INFO org.apache.zookeeper.ZooKeeper - Session: 0x0 closed
[main-EventThread] INFO org.apache.zookeeper.ClientCnxn - EventThread shut down for session: 0x0
No, headless service name is same. only the existing pods deleted and new pods created.
If we scale down brokers, zookeepers and scale up again then everything is working.
Hi @mmuehlbeyer ,
Thanks for sharing the issue reference. I think this is the same issue which we are facing.
There is no specific reason for using version 5.5.3. we have started with that version and still using the same version.
I have few questions regarding upgrade.
Can we update only the zookeeper version to 6.1 version from 3.5.8 keeping brokers version same?
Is version 6.1 compatible with 3.5.8 data, we don’t want to lose existing data?
Hi @mmuehlbeyer ,
I tried to upgrade zookeeper version to 6.2.2 but we are not able to bring zookeepers up. In logs we got these errors.
===> User
uid=1000(appuser) gid=1000(appuser) groups=1000(appuser)
===> Configuring ...
[Errno 13] Permission denied: '/var/lib/zookeeper/data/myid'
Command [/usr/local/bin/dub template /etc/confluent/docker/myid.template /var/lib/zookeeper/data/myid] FAILED !
2022-03-16 08:55:12,639] ERROR Disk error while locking directory /opt/kafka/data-0/logs (kafka.server.LogDirFailureChannel)
java.nio.file.AccessDeniedException: /opt/kafka/data-0/logs/.lock
[2022-03-16 08:55:14,971] INFO Loading logs from log dirs ArraySeq(/opt/kafka/data-0/logs) (kafka.log.LogManager)
[2022-03-16 08:55:14,975] INFO Skipping recovery for all logs in /opt/kafka/data-0/logs since clean shutdown file was found (kafka.log.LogManager)
[2022-03-16 08:55:14,979] ERROR Error while loading log dir /opt/kafka/data-0/logs (kafka.log.LogManager)
java.nio.file.AccessDeniedException: /opt/kafka/data-0/logs/.kafka_cleanshutdown
at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:90)
[2022-03-16 08:55:14,981] ERROR Error while loading log dir /opt/kafka/data-0/logs (kafka.server.LogDirFailureChannel)
java.nio.file.AccessDeniedException: /opt/kafka/data-0/logs/.kafka_cleanshutdown
While analyzinglogs we found, old image 5.5.3 is using root user while new image is using appuser user. so we are not able to get old data.
===> User
uid=0(root) gid=0(root) groups=0(root)
===> Configuring ...
===> Running preflight checks ...
===> Check if /var/lib/zookeeper/data is writable ...
===> Check if /var/lib/zookeeper/log is writable ...
===> Launching ...
===> Printing /var/lib/zookeeper/data/myid
1===> Launching zookeeper ...
can you please suggest a way to use existing data with new image?
Hi @mmuehlbeyer ,
We updated the security context and successfully updated confluent version after which the zookeeper issue is resolved.
Now we are able to apply rolling updates without any downtime.
Once again Thank you for you support throughout the process.