Hello,
Since few days we are facing a wired issue in our brokers. We have 3 brokers configuration and only leader broker pod is publishing metrics for other brokers we see below error in jmx-exporter container.
SEVERE: JMX scrape failed: java.io.IOException: Failed to retrieve RMIServer stub: javax.naming.ServiceUnavailableException [Root exception is java.rmi.ConnectException
: Connection refused to host: localhost; nested exception is:
java.net.ConnectException: Connection refused (Connection refused)]
at javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:369)
at javax.management.remote.JMXConnectorFactory.connect(JMXConnectorFactory.java:270)
at io.prometheus.jmx.JmxScraper.doScrape(JmxScraper.java:94)
at io.prometheus.jmx.JmxCollector.collect(JmxCollector.java:456)
at io.prometheus.client.CollectorRegistry$MetricFamilySamplesEnumeration.findNextElement(CollectorRegistry.java:183)
at io.prometheus.client.CollectorRegistry$MetricFamilySamplesEnumeration.<init>(CollectorRegistry.java:147)
at io.prometheus.client.CollectorRegistry.filteredMetricFamilySamples(CollectorRegistry.java:134)
at io.prometheus.client.exporter.HTTPServer$HTTPMetricHandler.handle(HTTPServer.java:60)
at com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:79)
at sun.net.httpserver.AuthFilter.doFilter(AuthFilter.java:83)
at com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:82)
at sun.net.httpserver.ServerImpl$Exchange$LinkHandler.handle(ServerImpl.java:675)
at com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:79)
at sun.net.httpserver.ServerImpl$Exchange.run(ServerImpl.java:647)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: javax.naming.ServiceUnavailableException [Root exception is java.rmi.ConnectException: Connection refused to host: localhost; nested exception is:
java.net.ConnectException: Connection refused (Connection refused)]
at com.sun.jndi.rmi.registry.RegistryContext.lookup(RegistryContext.java:136)
at com.sun.jndi.toolkit.url.GenericURLContext.lookup(GenericURLContext.java:205)
at javax.naming.InitialContext.lookup(InitialContext.java:417)
at javax.management.remote.rmi.RMIConnector.findRMIServerJNDI(RMIConnector.java:1955)
at javax.management.remote.rmi.RMIConnector.findRMIServer(RMIConnector.java:1922)
at javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:287)
... 16 more
Caused by: java.rmi.ConnectException: Connection refused to host: localhost; nested exception is:
java.net.ConnectException: Connection refused (Connection refused)
at sun.rmi.transport.tcp.TCPEndpoint.newSocket(TCPEndpoint.java:619)
at sun.rmi.transport.tcp.TCPChannel.createConnection(TCPChannel.java:216)
at sun.rmi.transport.tcp.TCPChannel.newConnection(TCPChannel.java:202)
at sun.rmi.server.UnicastRef.newCall(UnicastRef.java:338)
at sun.rmi.registry.RegistryImpl_Stub.lookup(RegistryImpl_Stub.java:112)
at com.sun.jndi.rmi.registry.RegistryContext.lookup(RegistryContext.java:132)
... 21 more
Caused by: java.net.ConnectException: Connection refused (Connection refused)
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at java.net.Socket.connect(Socket.java:538)
at java.net.Socket.<init>(Socket.java:434)
at java.net.Socket.<init>(Socket.java:211)
at sun.rmi.transport.proxy.RMIDirectSocketFactory.createSocket(RMIDirectSocketFactory.java:40)
at sun.rmi.transport.proxy.RMIMasterSocketFactory.createSocket(RMIMasterSocketFactory.java:148)
at sun.rmi.transport.tcp.TCPEndpoint.newSocket(TCPEndpoint.java:613)
... 26 more
In brokers we see errors like
ERROR [Controller id=2 epoch=145] Controller 2 epoch 145 failed to change state for partition topic-name from OfflinePartition to OnlinePartition (state.change.logger)
kafka.common.StateChangeFailedException: Failed to elect leader for partition topic-name under strategy OfflinePartitionLeaderElectionStrategy(false)
ERROR [Controller id=2 epoch=145] Controller 2 epoch 145 failed to change state for partition topic from OfflinePartition to OnlinePartition (state.change.logger)
liveLeaders=(cp-kafka-headless.kafka:9092 (id: 0 rack: null))) to broker cp-kafka-headless.kafka:9092 (id: 0 rack: null). Reconnecting to broker. (kafka.controller.RequestSendThread)
java.io.IOException: Connection to 0 was disconnected before the response was read
at org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:100)
at kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:253)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:96)
but when I check from AKHQ everything is working fine, I can publish and consume messages from all 3 brokers.
How to check why Prometheus is not able to connect to broker?
Thanks