Null pointer exception on any command in broker

Hi,

had a(nother) weird issue in our test cluster just now -
I was trying to look at broker/topic default values to find reasons for our performance issue.

Wehn connecting to the broker and running any command (kafka_config, kafka_topics, anything) I was getting a Java Null Pointer exception:

sh-4.4$ ./kafka-topics --list
java.lang.NullPointerException
        at java.base/java.util.Objects.requireNonNull(Unknown Source)
        at java.base/sun.nio.fs.UnixFileSystem.getPath(Unknown Source)
        at java.base/java.nio.file.Path.of(Unknown Source)
        at java.base/java.nio.file.Paths.get(Unknown Source)
        at java.base/jdk.internal.platform.CgroupUtil.lambda$readStringValue$1(Unknown Source)
        at java.base/java.security.AccessController.doPrivileged(Unknown Source)
        at java.base/jdk.internal.platform.CgroupUtil.readStringValue(Unknown Source)
        at java.base/jdk.internal.platform.CgroupSubsystemController.getStringValue(Unknown Source)
        at java.base/jdk.internal.platform.CgroupSubsystemController.getLongValue(Unknown Source)
        at java.base/jdk.internal.platform.cgroupv1.CgroupV1Subsystem.getLongValue(Unknown Source)
        at java.base/jdk.internal.platform.cgroupv1.CgroupV1Subsystem.getHierarchical(Unknown Source)
        at java.base/jdk.internal.platform.cgroupv1.CgroupV1Subsystem.initSubSystem(Unknown Source)
        at java.base/jdk.internal.platform.cgroupv1.CgroupV1Subsystem.getInstance(Unknown Source)
        at java.base/jdk.internal.platform.CgroupSubsystemFactory.create(Unknown Source)
        at java.base/jdk.internal.platform.CgroupSubsystemFactory.create(Unknown Source)
        at java.base/jdk.internal.platform.CgroupMetrics.getInstance(Unknown Source)
        at java.base/jdk.internal.platform.SystemMetrics.instance(Unknown Source)
        at java.base/jdk.internal.platform.Metrics.systemMetrics(Unknown Source)
        at java.base/jdk.internal.platform.Container.metrics(Unknown Source)
        at jdk.management/com.sun.management.internal.OperatingSystemImpl.<init>(Unknown Source)
        at jdk.management/com.sun.management.internal.PlatformMBeanProviderImpl.getOperatingSystemMXBean(Unknown Source)
        at jdk.management/com.sun.management.internal.PlatformMBeanProviderImpl$3.nameToMBeanMap(Unknown Source)
        at java.management/java.lang.management.ManagementFactory.lambda$getPlatformMBeanServer$0(Unknown Source)
        at java.base/java.util.stream.ReferencePipeline$7$1.accept(Unknown Source)
        at java.base/java.util.stream.ReferencePipeline$2$1.accept(Unknown Source)
        at java.base/java.util.HashMap$ValueSpliterator.forEachRemaining(Unknown Source)
        at java.base/java.util.stream.AbstractPipeline.copyInto(Unknown Source)
        at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(Unknown Source)
        at java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(Unknown Source)
        at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(Unknown Source)
        at java.base/java.util.stream.AbstractPipeline.evaluate(Unknown Source)
        at java.base/java.util.stream.ReferencePipeline.forEach(Unknown Source)
        at java.management/java.lang.management.ManagementFactory.getPlatformMBeanServer(Unknown Source)
        at jdk.management.agent/sun.management.jmxremote.ConnectorBootstrap.startLocalConnectorServer(Unknown Source)
        at jdk.management.agent/jdk.internal.agent.Agent.startLocalManagementAgent(Unknown Source)
        at jdk.management.agent/jdk.internal.agent.Agent.startAgent(Unknown Source)
        at jdk.management.agent/jdk.internal.agent.Agent.startAgent(Unknown Source)
Exception thrown by the agent : java.lang.NullPointerException

I thought it might be networking or an error, so I looked at that, and at the broker logs, but nothing… Didnt see any other issue, client app was running fine, metrics were gathered fine…

I started searching the internet but no realy simple solution came up except a “restarted everything and it worked again” so thats what I did, I restarted the broker on the affected box.
And it helped, working fine again.
So the question now is - what the *** is the problem here?
I mean a central component like kafka should not become instable by itself, thats not leaving a good impression

I upgraded to “release”: “7.8.1-37” sometime last week to see if it had any fixes for the perf issue (it didnt)

Any idea what might have happened here ?
Thanks

hi @Rand
anything in the brokers or controllers logs?

best,
michael

Hi,

no,
controller only saying this upon broker restart (that is still an unanswered question btw how that can happen with identical images):

[2025-02-14 14:00:33,764] WARN [QuorumController id=1] Broker 6 registered with feature metadata.version that is unknown to the controller (org.apache.kafka.controller.ClusterControlManager)
[2025-02-17 09:21:58,400] WARN [QuorumController id=1] Broker 4 registered with feature metadata.version that is unknown to the controller (org.apache.kafka.controller.ClusterControlManager)

Broker

[2025-02-14 14:00:05,909] WARN [ReplicaFetcher replicaId=4, leaderId=6, fetcherId=0] Partition transientChatter-events-5 marked as failed (kafka.server.ReplicaFetcherThread)
[2025-02-14 14:00:05,909] WARN [ReplicaFetcher replicaId=4, leaderId=6, fetcherId=0] Partition transientChatter-events-0 marked as failed (kafka.server.ReplicaFetcherThread)
===> User
uid=1000(appuser) gid=1000(appuser) groups=1000(appuser)
===> Configuring ...
Running in KRaft mode...
SSL is enabled.
===> Running preflight checks ...
===> Check if /var/lib/kafka/data is writable ...
===> Running in KRaft mode, skipping Zookeeper health check...
===> Using provided cluster id <id> ...
2025-02-17 09:21:50.754 | main | INFO | io.prometheus.jmx.JavaAgent | Starting ... 2025-02-17 09:21:51.071 | main | INFO | io.prometheus.jmx.JavaAgent | HTTP enabled [true] 2025-02-17 09:21:51.071 | main | INFO | io.prometheus.jmx.JavaAgent | HTTP host:port [0.0.0.0:8091] 2025-02-17 09:21:51.071 | main | INFO | io.prometheus.jmx.JavaAgent | OpenTelemetry enabled [false] 2025-02-17 09:21:51.120 | main | INFO | io.prometheus.jmx.JavaAgent | Running ... Log directory /data/cpkafka-data is already formatted. Use --ignore-formatted to ignore this directory and format the others.
===> Launching ...
===> Launching kafka ...
2025-02-17 09:21:52.230 | main | INFO | io.prometheus.jmx.JavaAgent | Starting ...
2025-02-17 09:21:52.540 | main | INFO | io.prometheus.jmx.JavaAgent | HTTP enabled [true]
2025-02-17 09:21:52.540 | main | INFO | io.prometheus.jmx.JavaAgent | HTTP host:port [0.0.0.0:8091]
2025-02-17 09:21:52.540 | main | INFO | io.prometheus.jmx.JavaAgent | OpenTelemetry enabled [false]
2025-02-17 09:21:52.589 | main | INFO | io.prometheus.jmx.JavaAgent | Running ...

Nothing indicating any issue at all…