Brokers failing to connect to zookeeper

naveen8384 · 17 May 2022 17:12

Hello All,
We are facing one issue when we restart both zookeepers and brokers at same time.
We have a 3 brokers pods and 3 zookeeper pods running on kubernetes environment and when there is any activity on kubernetes environment or when we deploy helm chart after making any changes that causes zookeeper and brokers to recreate then we got this issue.
In logs we can see that broker is trying to connect using zookeepers headless service but timing out.
Then again we have to manually restart brokers and zookeepers to connect again.

This is causing downtime whenever we are deploying new release. Is there any option to deploy without manually restarting pods?

mmuehlbeyer · 18 May 2022 05:21

Hi @naveen8384

could provide some details about you env?
I assume you’re using a helm chart to start the stack? if yes could you provide some details about the helm chart?

best,
michael

naveen8384 · 18 May 2022 08:19

Hi @mmuehlbeyer ,
we are using confluent helm charts, which are using zookeeper image version 5.5.3 and kafka image version 5.5.3.

this are the logs we got in broker pod

+ export KAFKA_BROKER_ID=0
+ exec /etc/confluent/docker/run
===> User
uid=0(root) gid=0(root) groups=0(root)
===> Configuring ...
===> Running preflight checks ...
===> Check if /var/lib/kafka/data is writable ...
===> Check if Zookeeper is healthy ...
[main] INFO org.apache.zookeeper.ZooKeeper - Client environment:zookeeper.version=3.5.8-f439ca583e70862c3068a1f2a7d4d068eec33315, built on 05/04/2020 15:53 GMT
[main] INFO org.apache.zookeeper.ZooKeeper - Client environment:host.name=kafka-cp-kafka-0.kafka-cp-kafka-headless.kafka.svc.cluster.local
[main] INFO org.apache.zookeeper.ZooKeeper - Client environment:java.version=1.8.0_222
[main] INFO org.apache.zookeeper.ZooKeeper - Client environment:java.vendor=Azul Systems, Inc.
[main] INFO org.apache.zookeeper.ZooKeeper - Client environment:java.home=/usr/lib/jvm/zulu-8-amd64/jre
[main] INFO org.apache.zookeeper.ZooKeeper - Client environment:java.class.path=/etc/confluent/docker/docker-utils.jar
[main] INFO org.apache.zookeeper.ZooKeeper - Client environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
[main] INFO org.apache.zookeeper.ZooKeeper - Client environment:java.io.tmpdir=/tmp
[main] INFO org.apache.zookeeper.ZooKeeper - Client environment:java.compiler=<NA>
[main] INFO org.apache.zookeeper.ZooKeeper - Client environment:os.name=Linux
[main] INFO org.apache.zookeeper.ZooKeeper - Client environment:os.arch=amd64
[main] INFO org.apache.zookeeper.ZooKeeper - Client environment:os.version=5.4.148-1.el7.elrepo.x86_64
[main] INFO org.apache.zookeeper.ZooKeeper - Client environment:user.name=root
[main] INFO org.apache.zookeeper.ZooKeeper - Client environment:user.home=/root
[main] INFO org.apache.zookeeper.ZooKeeper - Client environment:user.dir=/
[main] INFO org.apache.zookeeper.ZooKeeper - Client environment:os.memory.free=150MB
[main] INFO org.apache.zookeeper.ZooKeeper - Client environment:os.memory.max=2276MB
[main] INFO org.apache.zookeeper.ZooKeeper - Client environment:os.memory.total=153MB
[main] INFO org.apache.zookeeper.ZooKeeper - Initiating client connection, connectString=kafka-cp-zookeeper-headless:2181 sessionTimeout=40000 watcher=io.confluent.admi
n.utils.ZookeeperConnectionWatcher@65b3120a
[main] INFO org.apache.zookeeper.common.X509Util - Setting -D jdk.tls.rejectClientInitiatedRenegotiation=true to disable client-initiated TLS renegotiation
[main] INFO org.apache.zookeeper.ClientCnxnSocket - jute.maxbuffer value is 4194304 Bytes
[main] INFO org.apache.zookeeper.ClientCnxn - zookeeper.request.timeout value is 0. feature enabled=
[main-SendThread(kafka-cp-zookeeper-headless:2181)] INFO org.apache.zookeeper.ClientCnxn - Opening socket connection to server kafka-cp-zookeeper-headless/<ip>
:2181. Will not attempt to authenticate using SASL (unknown error)
[main-SendThread(kafka-cp-zookeeper-headless:2181)] INFO org.apache.zookeeper.ClientCnxn - Socket connection established, initiating session, client: /<ip>:3978
8, server: kafka-cp-zookeeper-headless/<ip>:2181
[main] ERROR io.confluent.admin.utils.ClusterStatus - Timed out waiting for connection to Zookeeper server [kafka-cp-zookeeper-headless:2181].
[main-SendThread(kafka-cp-zookeeper-headless:2181)] WARN org.apache.zookeeper.ClientCnxn - Client session timed out, have not heard from server in 40000ms for sessionid
 0x0
[main] INFO org.apache.zookeeper.ZooKeeper - Session: 0x0 closed
[main-EventThread] INFO org.apache.zookeeper.ClientCnxn - EventThread shut down for session: 0x0

mmuehlbeyer · 18 May 2022 08:23

hi @naveen8384

thanks for the information.
regarding

could you provide an example what you’re doing?

best,
michael

naveen8384 · 18 May 2022 08:35

Any changes which causes broker and zookeeper pods to recreate, like changing resource requests or limits for broker and zookeeper pods.

mmuehlbeyer · 18 May 2022 08:43

ok I see

are you changing config like this?

helm upgrade <release name> cp-helm-charts

best,
michael

naveen8384 · 18 May 2022 08:48

Yes, this is how we are deploying changes.

mmuehlbeyer · 18 May 2022 08:57

ok did you check if the uri/servicename of the headless service changes?

naveen8384 · 18 May 2022 10:10

No, headless service name is same. only the existing pods deleted and new pods created.
If we scale down brokers, zookeepers and scale up again then everything is working.

mmuehlbeyer · 18 May 2022 12:19

strange
there is a similar issue listed here

github.com/confluentinc/cp-helm-charts

ERROR io.confluent.admin.utils.ClusterStatus - Timed out waiting for connection to Zookeeper server [zookeeper.operator.svc.cluster.local:2181/kafka-operator].

opened 05:31PM - 26 Feb 20 UTC

tamipangadil

I'm trying to install Confluent Operator in our K8s cluster with Istio in it. Al…though the instruction wasn't included in their quick start guidelines, I'm hoping someone maybe came across with this problem. Here's the steps I've done: 1. Install the cluster with minimum requirements of 10 nodes 2. Install `istio` components 2. Install the `operator` charts 3. Install `zookeeper` charts 4. Install `kafka` charts <--- **Always failed** Errors based on kafka container: ``` [main-SendThread(zookeeper.operator.svc.cluster.local:2181)] INFO org.apache.zookeeper.ClientCnxn - Opening socket connection to server zookeeper.operator.svc.cluster.local/46.19.XXX.XXX:2181. Will not attempt to authenticate using SASL (unknown error) [main-SendThread(zookeeper.operator.svc.cluster.local:2181)] INFO org.apache.zookeeper.ClientCnxn - Socket connection established, initiating session, client: /192.168.XX.XX:45356, server: zookeeper.operator.svc.cluster.local/46.19.XXX.XXX:2181 [main-SendThread(zookeeper.operator.svc.cluster.local:2181)] WARN org.apache.zookeeper.ClientCnxn - Session 0x0 for server zookeeper.operator.svc.cluster.local/46.19.XXX.XXX:2181, unexpected error, closing socket connection and attempting reconnect java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:192) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:75) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:363) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1223) [main] ERROR io.confluent.admin.utils.ClusterStatus - Timed out waiting for connection to Zookeeper server [zookeeper.operator.svc.cluster.local:2181/kafka-operator]. [main-SendThread(zookeeper.operator.svc.cluster.local:2181)] INFO org.apache.zookeeper.ClientCnxn - Opening socket connection to server zookeeper.operator.svc.cluster.local/46.19.XXX.XXX:2181. Will not attempt to authenticate using SASL (unknown error) [main-SendThread(zookeeper.operator.svc.cluster.local:2181)] INFO org.apache.zookeeper.ClientCnxn - Socket connection established, initiating session, client: /192.168.XX.XX:41440, server: zookeeper.operator.svc.cluster.local/46.19.XXX.XXX:2181 [main] INFO org.apache.zookeeper.ZooKeeper - Session: 0x0 closed [main-EventThread] INFO org.apache.zookeeper.ClientCnxn - EventThread shut down for session: 0x0 ``` Tried some workaround with no luck. 1. Install VirtualService to explicitly point to the `zookeeper` service ``` cat <<EOF | kubectl -n operator apply -f - apiVersion: networking.istio.io/v1alpha3 kind: VirtualService metadata: name: zookeeper spec: hosts: - zookeeper.operator.svc.cluster.local http: - name: zookeeper route: - destination: host: zookeeper --- EOF ``` 2. And also ServiceEntry for both `zookeeper` and `kafka` ``` cat <<EOF | kubectl -n operator apply -f - apiVersion: networking.istio.io/v1alpha3 kind: ServiceEntry metadata: name: zookeeper spec: location: MESH_INTERNAL hosts: - zookeeper.operator.svc.cluster.local trafficPolicy: tls: mode: ISTIO_MUTUAL --- EOF cat <<EOF | kubectl -n operator apply -f - apiVersion: networking.istio.io/v1alpha3 kind: ServiceEntry metadata: name: kafka spec: location: MESH_INTERNAL hosts: - kafka.operator.svc.cluster.local trafficPolicy: tls: mode: ISTIO_MUTUAL --- EOF ``` Is it related to any configuration of Confluent helm charts? Thanks! Reference page: [Confluent Operator Quick Start](https://docs.confluent.io/current/installation/operator/co-quickstart.html).

according to the docs confluent 5.5.3 ships with zookeeper 3.5.7 and the issue talks about zookeeper version 3.5.8

so you might have hit the bug

might be worth trying to test with confluent 6.x

btw: any special reasions for using these version?

best,
michael

naveen8384 · 19 May 2022 07:44

Hi @mmuehlbeyer ,
Thanks for sharing the issue reference. I think this is the same issue which we are facing.
There is no specific reason for using version 5.5.3. we have started with that version and still using the same version.
I have few questions regarding upgrade.
Can we update only the zookeeper version to 6.1 version from 3.5.8 keeping brokers version same?
Is version 6.1 compatible with 3.5.8 data, we don’t want to lose existing data?

mmuehlbeyer · 19 May 2022 09:29

hi @naveen8384

according to the docs
confluent 7.1 also ships with zookeeper 3.5.x
so I assume it should be working, but never tested by myself.

https://docs.confluent.io/platform/7.1.1/installation/upgrade.html#upgrade-zk

best,
michael

naveen8384 · 23 May 2022 06:44

Hi @mmuehlbeyer ,
I tried to upgrade zookeeper version to 6.2.2 but we are not able to bring zookeepers up. In logs we got these errors.

===> User
uid=1000(appuser) gid=1000(appuser) groups=1000(appuser)
===> Configuring ...
[Errno 13] Permission denied: '/var/lib/zookeeper/data/myid'
Command [/usr/local/bin/dub template /etc/confluent/docker/myid.template /var/lib/zookeeper/data/myid] FAILED !

2022-03-16 08:55:12,639] ERROR Disk error while locking directory /opt/kafka/data-0/logs (kafka.server.LogDirFailureChannel)
java.nio.file.AccessDeniedException: /opt/kafka/data-0/logs/.lock
[2022-03-16 08:55:14,971] INFO Loading logs from log dirs ArraySeq(/opt/kafka/data-0/logs) (kafka.log.LogManager)
[2022-03-16 08:55:14,975] INFO Skipping recovery for all logs in /opt/kafka/data-0/logs since clean shutdown file was found (kafka.log.LogManager)
[2022-03-16 08:55:14,979] ERROR Error while loading log dir /opt/kafka/data-0/logs (kafka.log.LogManager)
java.nio.file.AccessDeniedException: /opt/kafka/data-0/logs/.kafka_cleanshutdown
at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:90)

[2022-03-16 08:55:14,981] ERROR Error while loading log dir /opt/kafka/data-0/logs (kafka.server.LogDirFailureChannel)
java.nio.file.AccessDeniedException: /opt/kafka/data-0/logs/.kafka_cleanshutdown

While analyzinglogs we found, old image 5.5.3 is using root user while new image is using appuser user. so we are not able to get old data.

===> User
uid=0(root) gid=0(root) groups=0(root)
===> Configuring ...
===> Running preflight checks ...
===> Check if /var/lib/zookeeper/data is writable ...
===> Check if /var/lib/zookeeper/log is writable ...
===> Launching ...
===> Printing /var/lib/zookeeper/data/myid
1===> Launching zookeeper ...

can you please suggest a way to use existing data with new image?

Thanks
Naveen

mmuehlbeyer · 23 May 2022 06:51

Hi @naveen8384

did you try to change security context in your values.yml

securityContext: {}
  #  runAsUser: 1000
  #  runAsGroup: 1000

best,
michael

naveen8384 · 11 July 2022 08:11

Hi @mmuehlbeyer ,
We updated the security context and successfully updated confluent version after which the zookeeper issue is resolved.
Now we are able to apply rolling updates without any downtime.

Once again Thank you for you support throughout the process.

Topic		Replies	Views
Kafka Broker failed to connect to Zookeeper Confluent Cloud	0	121	5 September 2024
Confluent service - unstable on MAC services die Ops	10	3901	10 May 2021
Problem with the activate SSL security Lounge	10	5326	12 November 2021
Brokers restarting automatically Ops	4	3857	9 February 2022
Kafka-broker-api-versions Ops	5	4830	27 February 2023

Brokers failing to connect to zookeeper

Related topics