Cannot open channel to 2 at election address: Zookeeper Quorum in docker swarm 2 nodes

Hi,

I’m trying to deploy a confluent stack using docker swarm in two nodes. First I’m trying to deploy the zookeeper containers and then I will keep deploying the rest of containers.

I’ve created the swarm called “kafka-swarm” with two nodes:

$ docker node ls
ID                           HOSTNAME               STATUS  AVAILABILITY  MANAGER STATUS
c8kr8lxxjzhdcjah0gef30rk7    kafka2                 Ready   Active        
zr5rhmnb70m5rixsdh0bjyxza *  kafka1                 Ready   Active        Leader

Those nodes are virtual machines with IPs 136.1.1.116 and 136.1.1.117 respectively, which are both in a docker overlay network.

I have this docker-compose.yml which, I’m not gonna lie, is a bit of a Frankenstein of several pieces of code I found all over the internet:

---
version: '3'

services:

  zookeeper-1:
    image: 'confluentinc/cp-zookeeper:5.5.1'
    deploy:
      replicas: 1
      placement:
        constraints:
          - node.id==zr5rhmnb70m5rixsdh0bjyxza
    ports:
      - "22181:2181"
      - "22888:2888"
      - "23888:3888"
    environment:
      ZOOKEEPER_SERVER_ID: 1
      ZOOKEEPER_QUORUM_LISTEN_ON_ALL_IPS: 'true'
      ZOOKEEPER_CLIENT_PORT: 2181
      ZOOKEEPER_PEER_PORT: 2888
      ZOOKEEPER_LEADER_PORT: 3888
      ZOOKEEPER_TICK_TIME: 2000
      ZOOKEEPER_INIT_LIMIT: 5
      ZOOKEEPER_SYNC_LIMIT: 2
      ZOOKEEPER_SERVERS: 'kafka-swarm_zookeeper-1:22888:23888;kafka-swarm_zookeeper-2:32888:33888'
      ZOOKEEPER_CURRENT_NODE_HOSTNAME: kafka-swarm_zookeeper-1
      ZOOKEEPER_ELECTION_PORT_BIND_RETRY: 0
      KAFKA_OPTS: "-Dzookeeper.4lw.commands.whitelist=*"
	
  zookeeper-2:
    image: 'confluentinc/cp-zookeeper:5.5.1'
    deploy:
      replicas: 1
      placement:
        constraints:
          - node.id==c8kr8lxxjzhdcjah0gef30rk7
    ports:
      - "32181:2181"
      - "32888:2888"
      - "33888:3888"
    environment:
      ZOOKEEPER_SERVER_ID: 2
      ZOOKEEPER_QUORUM_LISTEN_ON_ALL_IPS: 'true'
      ZOOKEEPER_CLIENT_PORT: 2181
      ZOOKEEPER_PEER_PORT: 2888
      ZOOKEEPER_LEADER_PORT: 3888
      ZOOKEEPER_TICK_TIME: 2000
      ZOOKEEPER_INIT_LIMIT: 5
      ZOOKEEPER_SYNC_LIMIT: 2
      ZOOKEEPER_SERVERS: 'kafka-swarm_zookeeper-1:22888:23888;kafka-swarm_zookeeper-2:32888:33888'
      ZOOKEEPER_CURRENT_NODE_HOSTNAME: kafka-swarm_zookeeper-2
      ZOOKEEPER_ELECTION_PORT_BIND_RETRY: 0
      KAFKA_OPTS: "-Dzookeeper.4lw.commands.whitelist=*"

So, when I deploy it with:

$ docker stack deploy kafka-swarm --compose-file docker-compose.yml --with-registry-auth

I can see that both nodes are running a zookeepeer container, but when I see the logs I find that there is no quorum between the zookeepers. Here is the log until the first exception:

[2021-12-10 10:54:56,770] INFO Reading configuration from: /etc/kafka/zookeeper.properties (org.apache.zookeeper.server.quorum.QuorumPeerConfig)
[2021-12-10 10:54:56,780] INFO clientPortAddress is 0.0.0.0:2181 (org.apache.zookeeper.server.quorum.QuorumPeerConfig)
[2021-12-10 10:54:56,780] INFO secureClientPort is not set (org.apache.zookeeper.server.quorum.QuorumPeerConfig)
[2021-12-10 10:54:56,788] WARN No server failure will be tolerated. You need at least 3 servers. (org.apache.zookeeper.server.quorum.QuorumPeerConfig)
[2021-12-10 10:54:56,793] INFO autopurge.snapRetainCount set to 3 (org.apache.zookeeper.server.DatadirCleanupManager)
[2021-12-10 10:54:56,794] INFO autopurge.purgeInterval set to 0 (org.apache.zookeeper.server.DatadirCleanupManager)
[2021-12-10 10:54:56,794] INFO Purge task is not scheduled. (org.apache.zookeeper.server.DatadirCleanupManager)
[2021-12-10 10:54:56,799] INFO Log4j 1.2 jmx support found and enabled. (org.apache.zookeeper.jmx.ManagedUtil)
[2021-12-10 10:54:56,812] INFO Starting quorum peer (org.apache.zookeeper.server.quorum.QuorumPeerMain)
[2021-12-10 10:54:56,820] INFO Using org.apache.zookeeper.server.NIOServerCnxnFactory as server connection factory (org.apache.zookeeper.server.ServerCnxnFactory)
[2021-12-10 10:54:56,823] INFO Configuring NIO connection handler with 10s sessionless connection timeout, 1 selector thread(s), 4 worker threads, and 64 kB direct buffers. (org.apache.zookeeper.server.NIOServerCnxnFactory)
[2021-12-10 10:54:56,827] INFO binding to port 0.0.0.0/0.0.0.0:2181 (org.apache.zookeeper.server.NIOServerCnxnFactory)
[2021-12-10 10:54:56,853] INFO Logging initialized @484ms to org.eclipse.jetty.util.log.Slf4jLog (org.eclipse.jetty.util.log)
[2021-12-10 10:54:56,988] WARN o.e.j.s.ServletContextHandler@34cd072c{/,null,UNAVAILABLE} contextPath ends with /* (org.eclipse.jetty.server.handler.ContextHandler)
[2021-12-10 10:54:56,988] WARN Empty contextPath (org.eclipse.jetty.server.handler.ContextHandler)
[2021-12-10 10:54:57,006] INFO Setting -D jdk.tls.rejectClientInitiatedRenegotiation=true to disable client-initiated TLS renegotiation (org.apache.zookeeper.common.X509Util)
[2021-12-10 10:54:57,010] INFO zookeeper.snapshot.trust.empty : false (org.apache.zookeeper.server.persistence.FileTxnSnapLog)
[2021-12-10 10:54:57,035] INFO Local sessions disabled (org.apache.zookeeper.server.quorum.QuorumPeer)
[2021-12-10 10:54:57,035] INFO Local session upgrading disabled (org.apache.zookeeper.server.quorum.QuorumPeer)
[2021-12-10 10:54:57,035] INFO tickTime set to 3000 (org.apache.zookeeper.server.quorum.QuorumPeer)
[2021-12-10 10:54:57,035] INFO minSessionTimeout set to 6000 (org.apache.zookeeper.server.quorum.QuorumPeer)
[2021-12-10 10:54:57,035] INFO maxSessionTimeout set to 60000 (org.apache.zookeeper.server.quorum.QuorumPeer)
[2021-12-10 10:54:57,035] INFO initLimit set to 10 (org.apache.zookeeper.server.quorum.QuorumPeer)
[2021-12-10 10:54:57,050] INFO zookeeper.snapshotSizeFactor = 0.33 (org.apache.zookeeper.server.ZKDatabase)
[2021-12-10 10:54:57,052] INFO Using insecure (non-TLS) quorum communication (org.apache.zookeeper.server.quorum.QuorumPeer)
[2021-12-10 10:54:57,052] INFO Port unification disabled (org.apache.zookeeper.server.quorum.QuorumPeer)
[2021-12-10 10:54:57,052] INFO QuorumPeer communication is not secured! (SASL auth disabled) (org.apache.zookeeper.server.quorum.QuorumPeer)
[2021-12-10 10:54:57,052] INFO quorum.cnxn.threads.size set to 20 (org.apache.zookeeper.server.quorum.QuorumPeer)
[2021-12-10 10:54:57,068] INFO Snapshotting: 0x0 to /var/lib/zookeeper/data/version-2/snapshot.0 (org.apache.zookeeper.server.persistence.FileTxnSnapLog)
[2021-12-10 10:54:57,074] INFO currentEpoch not found! Creating with a reasonable default of 0. This should only happen when you are upgrading your installation (org.apache.zookeeper.server.quorum.QuorumPeer)
[2021-12-10 10:54:57,080] INFO acceptedEpoch not found! Creating with a reasonable default of 0. This should only happen when you are upgrading your installation (org.apache.zookeeper.server.quorum.QuorumPeer)
[2021-12-10 10:54:57,090] INFO jetty-9.4.24.v20191120; built: 2019-11-20T21:37:49.771Z; git: 363d5f2df3a8a28de40604320230664b9c793c16; jvm 1.8.0_212-b04 (org.eclipse.jetty.server.Server)
[2021-12-10 10:54:57,141] INFO DefaultSessionIdManager workerName=node0 (org.eclipse.jetty.server.session)
[2021-12-10 10:54:57,141] INFO No SessionScavenger set, using defaults (org.eclipse.jetty.server.session)
[2021-12-10 10:54:57,143] INFO node0 Scavenging every 660000ms (org.eclipse.jetty.server.session)
[2021-12-10 10:54:57,153] INFO Started o.e.j.s.ServletContextHandler@34cd072c{/,null,AVAILABLE} (org.eclipse.jetty.server.handler.ContextHandler)
[2021-12-10 10:54:57,161] INFO Started ServerConnector@370736d9{HTTP/1.1,[http/1.1]}{0.0.0.0:8080} (org.eclipse.jetty.server.AbstractConnector)
[2021-12-10 10:54:57,161] INFO Started @792ms (org.eclipse.jetty.server.Server)
[2021-12-10 10:54:57,161] INFO Started AdminServer on address 0.0.0.0, port 8080 and command URL /commands (org.apache.zookeeper.server.admin.JettyAdminServer)
[2021-12-10 10:54:57,169] INFO Election port bind maximum retries is 3 (org.apache.zookeeper.server.quorum.QuorumCnxManager)
[2021-12-10 10:54:57,179] INFO 1 is accepting connections now, my election bind port: kafka-swarm_zookeeper-1/10.0.1.2:23888 (org.apache.zookeeper.server.quorum.QuorumCnxManager)
[2021-12-10 10:54:57,190] INFO LOOKING (org.apache.zookeeper.server.quorum.QuorumPeer)
[2021-12-10 10:54:57,191] INFO New election. My id =  1, proposed zxid=0x0 (org.apache.zookeeper.server.quorum.FastLeaderElection)
[2021-12-10 10:54:57,194] INFO Notification: 2 (message format version), 1 (n.leader), 0x0 (n.zxid), 0x1 (n.round), LOOKING (n.state), 1 (n.sid), 0x0 (n.peerEPoch), LOOKING (my state)0 (n.config version) (org.apache.zookeeper.server.quorum.FastLeaderElection)
[2021-12-10 10:54:57,396] INFO Notification time out: 400 (org.apache.zookeeper.server.quorum.FastLeaderElection)
[2021-12-10 10:54:57,796] INFO Notification time out: 800 (org.apache.zookeeper.server.quorum.FastLeaderElection)
[2021-12-10 10:54:58,596] INFO Notification time out: 1600 (org.apache.zookeeper.server.quorum.FastLeaderElection)
[2021-12-10 10:55:00,197] INFO Notification time out: 3200 (org.apache.zookeeper.server.quorum.FastLeaderElection)
[2021-12-10 10:55:02,200] WARN Cannot open channel to 2 at election address kafka-swarm_zookeeper-2/10.0.1.4:33888 (org.apache.zookeeper.server.quorum.QuorumCnxManager)
java.net.SocketTimeoutException: connect timed out
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
        at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
        at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
        at java.net.Socket.connect(Socket.java:589)
        at org.apache.zookeeper.server.quorum.QuorumCnxManager.initiateConnection(QuorumCnxManager.java:373)
        at org.apache.zookeeper.server.quorum.QuorumCnxManager$QuorumConnectionReqThread.run(QuorumCnxManager.java:436)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

If I force the containers to be on the same node, there are no exceptions and there is quorum, but when they run on different nodes I get this exception all the time.

Have anyone experienced something similar? Or have an example of a running docker-swarm confluent stack to try to fix mine?

Thanks! Regards