Prometheus not publishing broker metrics

Hello,
Since few days we are facing a wired issue in our brokers. We have 3 brokers configuration and only leader broker pod is publishing metrics for other brokers we see below error in jmx-exporter container.

SEVERE: JMX scrape failed: java.io.IOException: Failed to retrieve RMIServer stub: javax.naming.ServiceUnavailableException [Root exception is java.rmi.ConnectException
: Connection refused to host: localhost; nested exception is:
        java.net.ConnectException: Connection refused (Connection refused)]
        at javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:369)
        at javax.management.remote.JMXConnectorFactory.connect(JMXConnectorFactory.java:270)
        at io.prometheus.jmx.JmxScraper.doScrape(JmxScraper.java:94)
        at io.prometheus.jmx.JmxCollector.collect(JmxCollector.java:456)
        at io.prometheus.client.CollectorRegistry$MetricFamilySamplesEnumeration.findNextElement(CollectorRegistry.java:183)
        at io.prometheus.client.CollectorRegistry$MetricFamilySamplesEnumeration.<init>(CollectorRegistry.java:147)
        at io.prometheus.client.CollectorRegistry.filteredMetricFamilySamples(CollectorRegistry.java:134)
        at io.prometheus.client.exporter.HTTPServer$HTTPMetricHandler.handle(HTTPServer.java:60)
        at com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:79)
        at sun.net.httpserver.AuthFilter.doFilter(AuthFilter.java:83)
        at com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:82)
        at sun.net.httpserver.ServerImpl$Exchange$LinkHandler.handle(ServerImpl.java:675)
        at com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:79)
        at sun.net.httpserver.ServerImpl$Exchange.run(ServerImpl.java:647)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: javax.naming.ServiceUnavailableException [Root exception is java.rmi.ConnectException: Connection refused to host: localhost; nested exception is:
        java.net.ConnectException: Connection refused (Connection refused)]
        at com.sun.jndi.rmi.registry.RegistryContext.lookup(RegistryContext.java:136)
        at com.sun.jndi.toolkit.url.GenericURLContext.lookup(GenericURLContext.java:205)
        at javax.naming.InitialContext.lookup(InitialContext.java:417)
        at javax.management.remote.rmi.RMIConnector.findRMIServerJNDI(RMIConnector.java:1955)
        at javax.management.remote.rmi.RMIConnector.findRMIServer(RMIConnector.java:1922)
        at javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:287)
        ... 16 more
Caused by: java.rmi.ConnectException: Connection refused to host: localhost; nested exception is:
        java.net.ConnectException: Connection refused (Connection refused)
        at sun.rmi.transport.tcp.TCPEndpoint.newSocket(TCPEndpoint.java:619)
        at sun.rmi.transport.tcp.TCPChannel.createConnection(TCPChannel.java:216)
        at sun.rmi.transport.tcp.TCPChannel.newConnection(TCPChannel.java:202)
        at sun.rmi.server.UnicastRef.newCall(UnicastRef.java:338)
        at sun.rmi.registry.RegistryImpl_Stub.lookup(RegistryImpl_Stub.java:112)
        at com.sun.jndi.rmi.registry.RegistryContext.lookup(RegistryContext.java:132)
        ... 21 more
Caused by: java.net.ConnectException: Connection refused (Connection refused)
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
        at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
        at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
        at java.net.Socket.connect(Socket.java:589)
        at java.net.Socket.connect(Socket.java:538)
        at java.net.Socket.<init>(Socket.java:434)
        at java.net.Socket.<init>(Socket.java:211)
        at sun.rmi.transport.proxy.RMIDirectSocketFactory.createSocket(RMIDirectSocketFactory.java:40)
        at sun.rmi.transport.proxy.RMIMasterSocketFactory.createSocket(RMIMasterSocketFactory.java:148)
        at sun.rmi.transport.tcp.TCPEndpoint.newSocket(TCPEndpoint.java:613)
        ... 26 more

In brokers we see errors like

ERROR [Controller id=2 epoch=145] Controller 2 epoch 145 failed to change state for partition topic-name from OfflinePartition to OnlinePartition (state.change.logger)
kafka.common.StateChangeFailedException: Failed to elect leader for partition topic-name under strategy OfflinePartitionLeaderElectionStrategy(false)

ERROR [Controller id=2 epoch=145] Controller 2 epoch 145 failed to change state for partition topic from OfflinePartition to OnlinePartition (state.change.logger)

liveLeaders=(cp-kafka-headless.kafka:9092 (id: 0 rack: null))) to broker cp-kafka-headless.kafka:9092 (id: 0 rack: null). Reconnecting to broker. (kafka.controller.RequestSendThread)
java.io.IOException: Connection to 0 was disconnected before the response was read
at org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:100)
at kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:253)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:96)

but when I check from AKHQ everything is working fine, I can publish and consume messages from all 3 brokers.

How to check why Prometheus is not able to connect to broker?

Thanks

when I check from AKHQ everything is working fine, I can publish and consume messages from all 3 brokers.

How to check why Prometheus is not able to connect to broker?

Prometheus doesn’t connect to kafka. It connects to the exporter. You will want to exec into the pod, run ps aux and look at the Java process args to verify the JVM arguments are defined as you expect. Then you can use curl to hit the exporter’s metrics endpoint on your own.

After that’s verified as working, go look at the ServiceMonitor definition.

We found the problem.
The issue is Prometheus getting timeout after 30 seconds and due to that metrics are not seen when we run Prometheus query. We increased the CPU for jmx-exporter container and now everything is working.